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Preface 


Genetics is the science on genes. One major task of genetics is to construct the 
genetic map through linkage analysis, and then locate the genetic loci on important 
traits on the constructed linkage maps, identify favorable alleles which are of value 
to human beings, and investigate their biochemical pathway from genotype to 
phenotype. This book is aimed to cover the linkage analysis and gene mapping 
methodologies which are applicable to self-pollinated, cross-pollinated, and clonal 
propagated species, and genetic populations derived from two homozygous parents, 
two heterozygous parents, and multiple homozygous parents. 

Genetically segregating populations is key to any genetic research. Chapter 1 
begins with mating designs and various types of genetic populations, followed by the 
structure of commonly used populations, collection and preliminary analysis of 
genotypic data, collection and ANOVA on phenotypic data, and estimation of 
variance components, heritability, and genotypic values. In chapter 2, the estimation 
of two-point recombination frequency is introduced through the linkage analysis in 
twenty bi-parental populations. Based on the estimated recombination frequency, 
chapter 3 covers the construction of linkage maps, which can be roughly classified 
into two steps, i.e., grouping and ordering. Introduced in chapter 4 are single marker 
analysis and simple interval mapping, without any background control. 
Chapters 5 and 6 introduce the inclusive composite interval mapping (ICIM) with 
background control for additive (and dominant) QTLs, epistatic QTLs, and QTL by 
environment interactions. Chapter 7 is on populations derived from two heterozygous 
parents, which may be two individuals in a random mating population, two clonal 
cultivars, or two single crosses from four homozygous inbred lines. Therefore, genetic 
analysis methods introduced in this chapter are applicable to the full-sib families in 
random mating species, Fı populations in asexually propagated species, and double 
cross F; populations from four pure-line parents. Chapter 8 is focused on the 
pure-line progeny populations derived from four and eight homozygous parents. 
Chapter 9 firstly introduces the analysis methods in selected bi-parental populations, 
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populations consisting of chromosomal segment substitution lines (CSSL), and 
nested association mapping (NAM) populations which are produced by crossing 
multiple parents with one common parent. §9.4 outlines the general procedure for 
fine mapping, map-based cloning, and functional analysis of quantitative trait genes, 
and §9.5 introduces briefly the association mapping methodology applicable in 
natural populations. The last chapter gives answers to frequently asked questions in 
QTL mapping, which have not been fully investigated in previous chapters. 

Most contents in this book come from the outcomes of our research activities at 
the Quantitative Genetics Group, established in 2005 when the senior author 
returned back to China after five years stay with the International Maize and Wheat 
Improvement Center (CIMMYT), headquartered in Mexico. The group is affiliated 
with the Institute of Crop Sciences, Chinese Academy of Agricultural Sciences 
(CAAS), working on three major areas: (1) breeding modelling, simulation, and 
prediction; (2) genetic analysis of quantitative traits; and (3) genetics and breeding 
software and tools development. Appendix A gives a list of journal articles published 
by the group, and appendix B gives a list of post-graduate dissertations, which are 
relevant to various chapters and sections in this book. An efficient genetic analysis 
cannot be possible without computer software packages. The group has spent sig- 
nificant efforts and resources in developing and upgrading three user-friendly and 
stand-alone packages, as listed in appendix C. QTL IciMapping is applicable for 
twenty bi-parental populations derived from two homozygous parents, and NAM 
and CSSL populations. GACD is applicable for F, (or full-sib) families from two 
heterozygous parents, and double cross F; populations from four homozygous par- 
ents. GAPL is applicable for the pure-line progeny populations from the 
inter-mating of three to eight homozygous parents. The three integrated packages 
are frequently mentioned where applicable to the contents of this book. 

Without the substantial financial support received by the group, it would be 
impossible for the group to conduct the productive researches. Neither is possible to 
have such a book! The authors are extremely grateful for the long-term support from 
the Generation Challenge Program (https://www.generationcp.org/) and the 
HarvestPlus Challenge Program (https://www-harvestplus.org/) of CGIAR 
(https://www.cgiar.org/). The authors highly appreciate the research project 
funding received from the National 973 Programs of China (2006CB101707, 
2014CB138105), and the Natural Science Foundation of China (31271798, 
31200917, 31671280, 31861143003). The authors are also grateful for the great 
support from the Agricultural Science and Technology Innovation Program of 
CAAS. 


Professor Jiankang WANG 

Institute of Crop Sciences, Chinese Academy of Agricultural Sciences 
No. 12 Zhongguancun South Street, Beijing 100081 

5 October 2021 
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Chapter 1 
Populations in Genetic Studies 


The population is a set of individuals sharing more or less common characteristics or 
properties. Population in biology can include all living individuals on the earth as far 
as the ecological system is concerned. It can also be referred to as all living indi- 
viduals of one biological species, such as populations of human beings, animals, 
plants, microbiology, etc. More often, one biological population consists of indi- 
viduals of one species living in specific areas or societies. As far as genetics is con- 
cerned, the population is much smaller, where the individuals are more closely 
related by co-ancestry or relationship by relatives, and therefore sharing more 
common characteristics. The genetic population can be any race of one biological 
species, any variety with genetic variation, or the progenies after sexual or asexual 
propagation using some individuals as parents. Individuals or lines included in one 
genetic population normally have clear relationships or kinship, but also are different 
or differ both phenotypically and genetically. For any genetic study, one or some- 
times several populations are needed. 

A number of different genotypes have to be included in one genetic population. 
Many factors can affect population architecture, such as mating systems, the 
number of parental lines, and population size. Developing the most suitable popu- 
lations is fundamental to most genetic studies. Population genetics is concerned with 
gene frequency and genotypic frequency in the genetic populations, how these fre- 
quencies change from the parental generation to progeny generation taking mating 
system, mutation, selection, random drift, etc. into consideration, and what effects 
the changes will make on each population. The number of alleles together with their 
frequencies at each locus, and the number of genotypes together with their fre- 
quencies are major parameters characterizing the population structure (Wang, 2017; 
Hartl and Clark, 2007; Hartl and Jones, 2005; Falconer and Mackay, 1996; Crow and 
Kimura, 1970). This chapter begins with mating designs and various types of genetic 
populations, followed by the structure of commonly used populations, collection and 
preliminary analysis of genotypic data, collection and analysis of variance (ANOVA) 
on phenotypic data, and estimation of variance components, heritability, and 
genotypic values. 
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1.1 Commonly Used Populations in Genetic Studies 


1.1.1 Bi-Parental Populations 


Various mating designs have been proposed and widely used in genetic studies 
(Wang, 2017; Bernardo, 2010; Lynch and Walsh, 1998). Populations derived from 
two homozygous parental lines (also called pure lines or fixed lines) are mostly used 
in plant genetic studies since the rediscovery of Mendel’s hybridization experiments 
in garden peas in 1900. The bi-parental mating design begins with two pure 
lines showing the obvious difference in one or several phenotypic traits. Hybridiza- 
tion is made between the two parents (represented by P; and P3) to generate their 
F; hybrid. Selfing of the F, hybrid generates the segregating population which is 
called Fy; hybridization between the Fı hybrid and its two parents generates the 
segregating populations which are called P;BC,F, if backcrossed with P,, and 
P,BC,F, if backcrossed with Pə. Selfing and backcrossing may be repeatedly applied 
in Fə, P}BC,F,, and P2BC;,F, so as to have more advanced generations. Recombi- 
nation inbred lines (RILs) are formed after several rounds of repeated selfing. 
However, pure lines, which are called the doubled haploid (DH) lines, can also be 
generated from F1, P;BC,F,, or PgBC,F, through one generation by DH technology. 
Figure 1.1 shows 20 bi-parental populations which are commonly used in genetic 
studies in plants, together with chromosomal segment substitution lines (CSSL) 
after repeated backcrossing and selfing, and the nested association mapping 
(NAM) population between several parents and one common parent. 


Parent P, (44 ) Parent P, (aa) Legends and notes 


— 5 . | 7 Hybridization 
Fı (Aa) 
J Selfing and single seed 
descent (SSD) 


1. P,BCiF) 7. Fy 2. P.BCiF) 
J J ! Repeated selfing and 
W single seed descent (SSD) 
9. PBC F; 13. P,BC,F, 8. F; 14. P BCF, 10. P;BC3F, 
: : i İ Doubled haploid (DH) 
: : : development 
15. P,BC,F, : : : 16. P;BC3F, 
M v v v 
11. P;BC,RIL 5. P,BC,RIL 4. F,RIL 6. PəŞBC,RIL 12.P,BC,RIL BCyF), BC,F, ete. 
I Marker assisted forward 
P|BC,F, P.BCiF), F, P,BC,F, P,BC,F, I and background 
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v 
19. P.BC,DH 17. P.:BC,DH 3. F,DH 18. P;BC,DH 20. P;BC,DH CSSLs 


RIL: recombination inbred line 


P, x CP P, x CP P; x CP ə P, x CP CSSL: Chromosomal segment 
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v v v Vv 0 CP: common parent 
RIL family 1 RIL family 2 RIL family 3 ae RIL family n 


Nested association mapping (NAM) population 


Fic. 1.1 — Biparental populations and their derivative relationship in genetic studies in 
plants. 
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At one polymorphism locus (no matter whether it is a marker or a gene), assume 
parent P, carries allele A, parent Ps carries allele a, and the genotypes of two parents 
are AA and aa, respectively. When selection and random drift due to the limited 
population size are not considered, the two alleles have equal frequency, i.e., 0.5, in 
selfing, repeated selfing, and DH populations starting from the F, hybrid. Each 
generation of backcrossing reduces the frequency of the non-recurrent parent allele 
by half. Based on the frequency of allele A, i.e., fa, the 20 biparental populations 
shown in figure 1.1 fall into five classes. 


(1) fa = 0.875. Following two generations of backcrossing with parent P4, the fre- 
quency of allele a is one-quarter of the frequency in Fj, i.e., 0.125, and the 
frequency of allele A is equal to 0.875 in P;BC2F}. Selfing, repeated selfing, and 
DH populations starting from P BCF; have the same gene frequency as that in 
P,BC.F,. Therefore, populations P:BCəF:, P,;BCoF2, P,BC.RIL and 
P,BC,DH shown in figure 1.1 belong to this category. 

(2) fa = 0.75. Following one generation of backcrossing with parent P4, the fre- 
quency of allele a becomes half of the frequency in Fy, i.e., 0.25, and the fre- 
quency of allele A is equal to 0.75 in P,BC:F.. Populations P,BC:F:, P:BCiF3, 
P,BC,RIL and P;BC,DH shown in figure 1.1 belong to this category. 

(3) fa = 0.5. Populations F», F3, FıRIL and F,DH shown in figure 1.1 belong to 
this category. 

(4) fa = 0.25. Following one generation of backcrossing with parent Pə, the fre- 
quency of allele A becomes half of the frequency in Fj, i.e., 0.25. Populations 
P,BC,F,, PeBC ,F2, P2BC,RIL and PBC DH shown in figure 1.1 belong to 
this category. 

(5) fa = 0.125. Following two generations of backcrossing with parent Py, the fre- 
quency of allele A becomes one-quarter of the frequency in Fy, i.e., 0.125. 
Populations P2BC2F,, P2BC2F2, P2BC2RIL and P;BC,DH shown in figure 1.1 
belong to this category. 


The classification mentioned above is based on gene frequency. There is one other 
classification which is based on whether heterozygote Aa is present in the popula- 
tion. As far as genotype Aa is concerned, the selfing generation has the typical 
Mendelian segregation ratio of 1:2:1 in three genotypes AA, Aa, and aa. The 
heterozygous genotype cannot be maintained by selfing, and therefore the popula- 
tion with heterozygotes is called temporary. When heterozygote Aa is absent, 
homozygotes AA and aaare the only genotypes in the population, and the genotype 
of any individual in the population does not change by selfing. Each individual in the 
population can form a pure line or family having the exactly same genotype through 
selfing propagation, and therefore the population is called permanent. For the 20 
populations shown in figure 1.1, BCoF,, BCəFə, BC,F,, BC ,F2, Fə, and F3 are 
temporary; the others are permanent. One major advantage to using permanent 
populations in genetic studies is to conduct multi-environmental and replicated 
phenotyping tests in the field. Multi-environments mentioned here can be several 
locations in one cropping season, several seasons in one location, or several locations 
in several seasons. In temporary populations, each individual has a unique genotype 
and therefore cannot be repeated by sexual propagation. To reduce random error in 
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phenotype, selfed families from the heterozygous individuals are usually developed 
and used in phenotyping tests, and the mean of each selfed family is used to rep- 
resent the phenotype of the heterozygous individual. 

Given on the right side of figure 1.1 is a special kind of permanent population, 
which is consisted of chromosomal segment substitution lines (CSSLs). For most 
genetic populations, selection and random drift should be avoided as much as 
possible during the population development procedure. The use of single seed des- 
cent (SSD) during the repeated selfing procedure (figure 1.1) can maximize the 
effective population size, and significantly reduce the effect of selection and random 
drift. To develop a set of CSSLs to cover the whole genome of the donor parent, 
backcrossing, selfing, marker-assisted background and forward selection may all be 
used. Ideally, each CSSL should carry one donor chromosomal segment and the 
other parts of the genome come from the recurrent or background parent. 
Each CSSL can be viewed as an iso-genic line to its background parent, and 
therefore any difference in phenotype can indicate the presence of genes on the donor 
chromosomal segment carried by the CSSL. Even though it takes a rather long time 
to develop, once acquired, CSSLs are valuable genetic materials in confirming, 
fine-mapping, and map-based cloning of both qualitative and quantitative genes (see 
also §9.2). Combined use of the substitution lines of a single segment and double 
segments provides the chance to investigate inter-genic interactions or epistasis as 
well. The use of hybrids between CSSLs and their background parent helps to 
understand the dominance-related effects and genetic mechanism of heterosis as well 
(Zhao et al., 2009; Xu et al., 2007; Wang et al., 2006, 2007; Kubo et al., 2002). 

High-valued genetic studies cannot be possible without high-quality and suitable 
genetic populations. Permanent populations consist of individuals or lines with fixed 
and homozygous genotypes, which have been widely used in genetic studies, espe- 
cially in plants. The bi-parental population is derived from a single cross between 
two diverse homozygous parents. When the two parents are highly diverse across all 
chromosomes in the genome, the bi-parental population can also be used to con- 
struct the genetic linkage map of the species. However, when a large number of genes 
are in segregation for one phenotypic trait in interest, these genes may interact with 
each other, making it difficult to separate the individual genes. As far as mapping of 
quantitative trait locus (QTL) is concerned, it is not easy to exclude the effect from 
many other QTLs to the current one during the one-dimensional scanning, needless 
to say, mapping of two interacting loci, i.e., epistatic QTL. Through the advanced 
backcrossing and selfing together with selection, each CSSL is different from the 
background parent only in a small amount of genomic region. Target genes that are 
located in the small region can be further investigated or even cloned by developing 
the so-called secondary populations either between two CSSLs or between one CSSL 
and the original background parent. When single-segment and double segments 
substitution lines are both available, interactions between genes on two chromoso- 
mal segments can also be investigated. Due to these advantages, a large number of 
CSSL populations have been developed in recent 20 years in various crop species, 
such as rice, wheat, maize, and soybean. 
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1.1.2 Multi-Parental Populations 


Bi-parental populations are normally derived from single crosses between two 
homozygous parents, where just two alleles are to be considered at each polymor- 
phism locus in genetic analysis. When two parents carry the same gene at some 
genomic regions or loci, only one allele is present in the population. 
Non-polymorphism genomic regions do not provide useful information in genetic 
analysis. Whether these regions have genetic effects on a trait in interest cannot be 
determined. In natural populations which are normally used in genome-wide asso- 
ciation studies (GWAS), more than two alleles, i.e., multiple alleles, may be present 
at one locus. However, linkage disequilibrium is low due to the reduced relationship 
in co-ancestry (or kinship), and therefore highly densely distributed molecular 
markers are needed in genotyping to catch the residual disequilibrium between 
molecular markers and target genes. In addition, individuals in a natural population 
have unequal kinship, and therefore the population structure is complex and in most 
cases is largely unknown. It is well known that the admixture of diverse 
sub-populations can cause false discovery between markers and target genes which 
are in fact genetically unlinked. How could we utilize suitable statistical tools to 
identify the unknown population structure and get rid of its effect on genetic studies, 
especially in gene detection is still an open question (Hirschhorn and Daly, 2005). 

In recent 10 years, more attention has been focused on mating designs using 
multiple parents and the development of multi-parental populations. More and more 
genetic studies on multi-parental populations have been conducted and reported. As 
shown at the bottom of figure 1.1, NAM is actually one multi-parental mating 
design between a set of parents and one common parent. Each family in the NAM 
design is a bi-parental population. To consider together, a number of bi-parental 
families consist of one NAM population, which is related by the common parent. 
The NAM design was first used in maize and published in Science (Buckler et al., 
2009; McMullen et al., 2009). Since then, this design has been greatly concerned in 
genetic studies and applied to many other plant species. By using the NAM design, 
it is expected that advantages from both the linkage analysis in controlled popu- 
lations and the association mapping in natural populations can be complemented. In 
the first NAM population developed at Cornell University, 25 maize inbred lines 
were used as parents and crossed with another common inbred line. Each F) was 
repeatedly selfed until the homozygous fixed lines are formed, resulting in 20 
bi-parental families, each having around 200 RILs. As a whole, the population is 
consisted of about 5000 RILs and can be used for GWAS or many other genetic 
studies. Separately, each bi-parental family can be used in linkage analysis for 
genetic map construction and gene discovery. 

When four parents are involved, a double-crossing design can be considered. 
From four parents, two single crosses are first made to generate two F, hybrids. 
Then a double cross is made between two F, hybrids. Double-crossing design 
involves more parents, and therefore introduces more genes and higher genetic 
variation in the derivative populations. This design has been occasionally used by 
breeders when pyramiding the favorite phenotypes dispersed in multiple varieties or 
materials. Similar to single cross Fr, starting from double cross F;, a permanent 
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population consisting of RILs or DHs can also be developed by repeatedly selfing or 
DH technology. Populations derived from double crosses provide abundant varia- 
tions for genetic studies and breeding applications. However, as more alleles are 
included, the number of possible genotypes at each locus increases significantly. It 
can be imagined that genetic analysis becomes more complicated as compared with 
the analysis in bi-parental populations. 

Clonal species are commonly seen in crops and plants, among which are pota- 
toes, sweet potatoes, cassava, many forest trees, and garden flowers. Clonal prop- 
agation is sometimes called asexual propagation. Under normal conditions, progeny 
from a parent in a clonal species does not come from fertilization between female and 
male gametes, as is the case of sexual propagation. Instead, clonal progeny comes 
from the re-born and re-growth of the vegetative parts or organisms in its parent, 
such as roots, stems, or sprouts. One individual parent can form a huge clonal 
progeny, in which all individual progenies are identical in genetic composition. 
Under normal conditions, each clonal line or variety is highly heterozygous. Under 
particular conditions, the clonal plants may also go through meiosis and produce 
female and male gametes, making it possible to conduct sexual propagation. When 
two clonal lines are crossed, their F, hybrids immediately show genotypic segrega- 
tion. Such F, populations are called the full sibs in animals, which are commonly 
used in genetic studies of clonal and random-mating species. One single cross 
between two heterozygous parents has a similar genetic structure as a double cross 
from four homozygous parents. This is understandable by taking the two single 
crosses to make the double cross as two heterozygous parents. When the linkage 
phase, i.e., coupling and repulsion, between linked loci is known, the single cross F, 
of two heterozygous parents is equivalent to the double cross F; of four homozygous 
parents (for more details see chapter 7). 

Figure 1.2 shows the diagram to make a double cross and its derivative popu- 
lations, where P-P; represents the four homozygous parents. The pure line parents 
can be maize inbred lines developed in hybrid breeding programs or wheat advanced 
lines or cultivars developed in conventional breeding programs. In double cross 
design, two single cross hybrids are firstly made, one between P; and Ps, and the 
other between Ps and P4, for example. Individuals in every single-cross hybrid 
are heterozygotes and have the exactly same genotype. As a population, every 
single-cross hybrid is homogeneous. Taking one hybrid as the female parent and the 
other one as the male parent, the second hybridization is made to generate a double 
cross F; hybrid. As the two parents are heterozygotes, genotypic segregation occurs 
immediately. When genetic diversity among the four homozygous parents is large 
enough, every individual in the double cross F; population may have a unique 
genotype. Therefore, double cross F; is heterozygous and heterogeneous (figure 1.2). 

Under the special condition, a cross can be made between two clonal parents in 
clonally propagated species. Each clonal parent is heterozygous and homogeneous, 
and therefore can be treated as the single cross F; hybrid between two inbred lines. 
Taking two clonal parents as two single cross F; hybrids, a single cross between two 
clonal parents can therefore be treated as a double cross of four pure line parents. 
Doubled haploid or repeated selfing is largely impossible in clonal species, due to 
severe inbreeding depression and sterility. Since each individual or progeny clonal 
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*e,, single seed descent 
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Heterogeneous & DHS RILs 
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Fic. 1.2 — Double cross of four pure line parents and their derivative populations. Notes: 
Homogeneous, only one genotype is present in the population; heterogeneous, multiple 
genotypes are present in the population; homozygous, two alleles are identical at each locus 
for each individual in the population; and heterozygous, two alleles are not identical at each 
locus. 


line has a unique genotype, the F, hybrids as a population between two 
heterozygous parents are suitable for replicated phenotyping and genetic studies. 
As for sexually propagated species, double cross F, is a temporary population, 
making it less likely to conduct the replicated phenotyping trials. Similar to 
bi-parental populations as shown in figure 1.1, doubled haploid technology and 
repeated selfing can be applied in double cross F; in order to develop permanent 
populations consisting of DH lines and RILs (figure 1.2). The genetic analysis 
method with pure-line populations derived from four inbred parents is given in 
chapter 8. 

When more parents are involved, there could be many possible ways to make the 
crosses. Four designs are shown in figure 1.3 by considering 8 inbred parents, which 
are called complete diallel, incomplete (or partial) diallel, single chain, and double 
chain mating designs, respectively. Obviously, different designs have to make dif- 
ferent numbers of crosses. The complete diallel design makes all pair-wise single 
crosses. When the reciprocal crosses are not included, the number of single crosses is 
equal to, ¿n(n — 1), where n is the number of parents. In figure 1.3A, a total of 28 
bi-parental crosses have to be made when n = 8. When there are two groups of 
parents, one has a size of mı and is used as females; the other has a size of nə and is 
used as males. Crossing between two parental groups is called partial diallel mating 
design (figure 1.3B), and the number of crosses to make is equal to m x mə. When 
nı = 3 and m = 5, a total of 15 bi-parental crosses have to be made. In fact, NAM 
design as shown in figure 1.1 can be viewed as a special case of partial diallel, 
1.€., one group has n — 1 parents, and the other one has just one parent. Single chain 
design assures that each parent appears once and only once in crossing (figure 1.3C), 
and the number of crosses to make is equal to the number of parents. Double 
chain design assures that each parent appears twice and only twice in crossing 
(figure 1.3D), and the number of crosses to be made is equal to two times the number 
of parents. 
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Fic. 1.3 — Multi-parental mating designs using eight parents as an example. Notes: 


(A) complete diallel mating design with no reciprocal crosses; (B) partial diallel mating 
design; (C) single chain mating design; (D) double chain mating design. 


Genetic study and plant breeding have significantly different objectives, which 
also determine what kinds of parents are to be used, and what kinds of populations 
are to be developed. For this reason, some populations may be highly suitable for 
genetic studies, but may have limited value to breeding; and verse versa. In order to 
develop populations with both genetic and breeding values, multi-parental mating 
design has acquired extensive attention in past decades, for example, MAGIC 
(multi-parent advanced generation inter-crossing) (Cavanagh et al., 2008; The 
Complex Trait Consortium, 2004; Broman et al., 2002). As mentioned earlier, 
multi-parental populations provide abundant genetic variation that can be used by 
geneticists, but also bring complexity and difficulty to the analyzing methods that 
could be applied. More factors such as multiple alleles have to be considered; 
genotypic values, genetic effects, and variance components are more difficult to be 
estimated with high accuracy. When more parents are included, it is not always clear 
what the best mating design is and what crosses need to be made so as to develop 
the most suitable populations for different research objectives. In practice, popu- 
lation development also depends on the mating system of the studied species, con- 
venience to conduct the hybridization, field phenotyping costs, etc. 

Chapters 2-6 in this book will cover linkage analysis and gene mapping in 
bi-parental populations. Chapter 7 will focus on heterogeneous and heterozygous 
populations derived from two heterozygous parents or four homozygous parents, and 
chapter 8 on heterogeneous and homozygous populations derived from multiple 
homozygous parents. Gene mapping in NAM populations and selected populations 
such as selective genotyping, and CSSLs, Mendelization of quantitative trait genes, 
and association mapping with natural populations will be introduced in chapter 9. 
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These analysis methods are not only applicable to self-pollinated, cross-pollinated 
and clonally propagated species in plants, but also applicable to many kinds of 
bi-parental and multi-parental populations which are commonly used in genetic 
studies. 


1.1.3 Considerations in Developing Genetic Populations 


1. Propose a clear objective to achieve 

Any scientific research or project must have clear objectives, the same is true for 
genetic studies. Genetics is the science of genes, and their inheritance and variation. 
Major research areas of genetics are molecular structure and function of genes, gene 
function or behaviors in cell and living organisms (e.g., dominance and epigenetics), 
transition modes from parents to progenies, and its distribution, variation, and 
changes in various populations under various affecting factors. The Genetics journal 
(http://www.genetics.org) once classified its published articles into eight categories, 
i.e., genomics, gene expression, cellular genetics, developmental and behavior 
genetics, population and evolution genetics, genetics of complex traits, gnome, and 
systems biology. For sure, different branches of genetics have different objectives and 
would use different genetic materials and populations. Above all, the first thing to 
do is make a clear objective, followed by the choice of suitable parents, and popu- 
lation development. 


2. Develop suitable genetic populations 

The genetic study depends on one or several populations. The development of 
suitable populations is a major premise that could result in meaningful genetic 
studies, during which the first step is to choose suitable parental materials. For 
example, a novel genetic material is identified to be resistant to disease in crops. The 
objectives are (1) to understand the inheritance of the resistant gene in the resistant 
material, and (2) to identify molecular markers which are closely linked with the 
resistant gene so as to use marker-assisted selection to transfer the resistant gene to 
other susceptible materials. To achieve the two objectives, crosses between resistant 
and susceptible materials are needed, together with their derivative progeny pop- 
ulations with genotypic segregation, such as bi-parental populations as shown in 
figure 1.1. 

When the segregating populations have been produced, the following works are 
phenotypic evaluation on disease resistance and genotypic screening on polymor- 
phism markers for every individual or line in those populations. Acting as checks, 
parental materials are always included in both phenotyping and genotyping tests. Ifa 
simple Mendelian segregation ratio is observed on resistant and susceptible pheno- 
types in the progeny populations, resistance to disease in the novel material can be 
explained by one pair of genes at one single locus. Qualitative traits determined by a 
single locus can actually be used as a marker. Therefore, the location of the locus can 
be determined by linkage analysis with other molecular markers on the constructed 
linkage map. Sometimes, disease resistance has to be measured quantitatively by an 
index of infection. It is not always possible to classify the tested progenies as either 
resistant or susceptible. When the Mendelian ratio cannot be observed, resistance in 
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the novel material may be controlled by genes at multiple loci. The location of these 
genes can only be determined by QTL (quantitative trait locus or loci) mapping 
approaches, which are the major contents of chapters 4—10 in this book. 

When parental materials used in crossing show differences in disease resistance 
but not much difference in many other traits, such as in the case of two iso-genic 
lines, their derived populations are suitable for genetic study on the disease resis- 
tance but may not be suitable on other traits. When resistance is qualitative and 
controlled by only one or two pairs of genes, accurate mapping can be achieved by 
using hundreds of segregating individuals or lines. To acquire more reliable and also 
validate the results of genetic studies, it may be necessary to choose a number of 
susceptible parents and develop several mapping populations. To determine whether 
the new resistant gene is located at a new genetic locus or the locus same as pre- 
viously identified genes, it is also needed to cross the novel resistant material with 
other resistant materials, i.e., complimentary test. By developing and utilizing more 
populations, more detailed knowledge of the novel resistant gene can be acquired. 


3. Maximize the effective population size and genetic variations 

Genetic populations differ from breeding populations in many aspects. Genetic 
populations are normally derived from crosses between some parents with favorable 
phenotypes and other parents with unfavorable phenotypes. Selection and random 
genetic drift should be avoided as much as possible during population development. 
Breeding populations are normally derived from crosses between elite parents both 
with favorable phenotypes, i.e., good X good to breeders. Intense selection is applied 
so as to increase the frequency of favorable genes and gene combinations. Devel- 
opment of the elite progenies outperforming both parents through transgressive 
segregation is a major objective of breeding. For genetic studies, populations have to 
maintain high genetic variation and make sure both favorable and unfavorable 
alleles are present in high frequency so as to identify the favorable alleles. As shown 
in figure 1.1, single seed descent (SSD) is widely used to develop pure line progenies, 
which maximizes the effective size of the developed population. Taking F,;RIL as an 
example, one large population of Fy can come from one single plant of F; hybrid 
when enough selfed seeds can be produced. Assuming there are 500 segregating 
individuals in the Fə population, only one selfed seed is harvested from each Fə 
plant. The succeeding F; generation also has a size of 500, and SSD is applied again. 
This procedure is repeated up to F7 or Fg generation; 500 RILs are retained and then 
used in phenotyping and genotyping tests. 

One major advantage to use SSD in developing genetic populations is that each 
RIL can be traced back to one single F» individual, and the effective population size 
is maximized and equal to the size of the Fy population. If the final 500 RILs can 
be traced back to 100 F» individuals only, the effective size of the 500 RILs would be 
much smaller than 500. In population genetics, effective size determines the effect of 
random drift. Smaller effective size causes greater drift, and larger deviations on 
gene and genotypic frequencies from their expectations, making the populations less 
suitable for genetic studies. When F; generation is used as the mapping population, 
each individual should be traced back to one single F» plant, so as to maintain the 
effective population size at 500. Were five seeds harvested each from 100 Fə 
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individual plants, the effective size would be greatly reduced in their F; generation of 
500 individuals. 

When considering the incomplete germinating rate of seeds, several seeds may be 
harvested from each plant and sown as one pot in the field. After germination, only 
one plant is kept in each pot. By doing this, the maximum effective population size is 
maintained. To make sure that at least one seed could germinate under a given 
probability level, the number of seeds to be harvested and sown is given by equa- 
tion 1.1 (Wang et al., 2004), where P is the probability that at least one seed will 
germinate, and r is germinating rate. 

n= In(1 — P) (1.1) 
İn(1 — r) 

Table 1.1 gives the least numbers of seeds under three probability levels and five 

germinating rates. 


TAB. 1.1 — The least number of seeds to be harvested 
from individual plants to make sure at least one plant 
will survive in the following generation. 

Probability at least one 


Germinating rate : : 
plant will survive 


0.999 0.99 0.95 
0.9 3 2 2 
0.8 5 3 2 
0.7 6 4 3 
0.6 8 6 4 
0.5 10 7 5 


4. Acquire reliable and accurate genotypic and phenotypic data 

Many kinds of molecular marker technologies have been used in genetic studies. 
For genotypic screening in progeny populations, parental materials need to be tested 
first so as to identify the polymorphism markers. Non-polymorphism markers do not 
provide crossover and recombination information, and therefore cannot be utilized 
in most genetic studies. Then individuals or lines in the progeny population are 
screened by markers showing polymorphism in parents. For some technologies, for 
example, genotyping by sequencing (GBS), marker number may not be a major 
limiting factor in genotyping cost. Genotypic screening can be conducted for a large 
number of markers in chips, from which polymorphism markers are determined and 
utilized in genetic studies. 

Most quantitative traits are sensitive to environments, and demonstrate signif- 
icant genotype by environment interactions; heritability at the individual plant level 
is low, which is the major reason to use permanent populations. In permanent 
populations, phenotyping can be conducted with replications in multi-environments, 
where a large number of homogeneous individuals are grown and tested. The 
environment can be location, season, location-by-season combination, sowing date 
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or field management, etc. In field trials, a suitable experimental design needs to be 
applied so as to control and reduce random error effects associated with phenotypic 
observations (Kuehl, 2000; Yandel, 1997). 


1.2 Preliminary Analysis of Genotypic Data 


1.2.1 Collection and Coding of Genotypic Data 


Genotypic data in genetic populations are based on identifiable genetic markers. 
Morphological characters, biochemical products, and molecular bands on gels are 
some examples. Morphological markers refer to phenotypic characters in appear- 
ance. Most morphological markers are in fact the typical qualitative traits controlled 
by single genes. Biochemical markers are various biochemical products that can be 
detected by specific biochemical equipment. Molecular markers represent the 
polymorphism in DNA sequence, which are abundant in number and also the most 
underlying genetic variations. Genetic markers have been mainly used in con- 
structing genetic linkage maps, locating unknown genes on phenotypic traits, 
understanding the linkage relationship between genes, conducting marker-assisted 
selection in breeding, and so on. Genetic markers can indicate the chromosomal 
regions and locations. In genetic populations, individuals normally have different 
marker types at different loci. By using marker type information, linkage analysis 
can be conducted to determine the order of markers linearly arranged on the 
chromosome, i.e., chromosomal locations of markers and genes. Some markers could 
be genes affecting the phenotypic traits, for example, the single-gene qualitative 
traits, and single nucleotide polymorphism (SNP) within the coding region of the 
DNA sequence of genes. A large number of markers may be neutral to any pheno- 
typic traits but are linked with functional genes. Two major objectives may be 
stated here when investigating the linkage relationship between markers and genes: 
(1) fine-mapping and cloning of the target genes; (2) indirect selection on target 
genes assisted by the linked markers. 

Only markers showing polymorphism between parents or in genetic populations 
can be useful for genetic studies. For morphological markers, two parental types may 
be awn versus awnless in plant spikes, dwarf versus tall in plant height, red versus 
white in flower color, etc. For biochemical markers, two parental types may be 
present versus the absence of some isozymes or chemical products. For molecular 
markers, two parental types may be present versus absent of the given DNA 
sequences, a given number of repeats of a short DNA sequence, or single nucleotide 
difference at given sites on DNA sequence, etc. If markers are also treated as traits, 
they can also be classified as co-dominant, dominant, and recessive. In order to 
conduct the analysis properly, coding rules are needed in order to represent various 
kinds of markers and marker types. Below are the rules used in the QTL IciMapping 
integrated software package (Meng et al., 2015). Both numbers and letters can be 
used in the software. We first explain the coding rule by numbers. 
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Taking molecular markers as an example. When the DNA segment digested by a 
specific endonuclease enzyme is different in length between homozygous parents P, 
and Ps, such a polymorphism can be observed on electrophoresis gels as identifiable 
bands at two different positions, such as two solid lines as shown in figure 1.4. When 
both bands are present in their F,; hybrid, such polymorphism is called co-dominant, 
and the three types of bands in P4, Pə, and F, represent the three possible marker 
types or genotypes. All three types are present in temporary populations derived by 
the two parents, which are coded by 2, 0, and 1, representing the genotypes of P1, P2, 
and F,, respectively. Only two types are present in permanent populations, which 
are coded by 2 and 0, representing the genotypes of Pı and Ps, respectively. 


Type of Typeof Type of Possible types in temporary Possible types in 
parent P, parent P, hybrid F} populations permanent populations 
— paz a — -— 
2 0 1 2 0 1 2 0 


Fic. 1.4 — Co-dominant marker and the coding method in software QTL IciMapping. 


When one band is present in P; but no band can be observed in Ps, such as the 
bold solid band shown in figure 1.5, the band of Pı would be present in F: as well 
and is therefore called dominant. Two types can be observed in temporary popu- 
lations containing heterozygous individuals, £.e., present and absent of the band of 
Pı. However, individuals showing the black band may be homozygous (same as Pı) 
or heterozygous (same as F,). Genotypes of these individuals cannot be completely 
determined without using additional generations or populations. Therefore, the 
dominant band is coded as 12, representing two possible genotypes of Fy 
(coded as 1) and P, (coded as 2). When no band is observed in temporary popu- 
lations, 0 is given to represent the genotype of Po (figure 1.5). For permanent 
populations without heterozygotes, present and absent of the band of P: are coded 
as 2 and 0 for two homozygotes, respectively, when missing marker types are not 
included. However, when physical missing does exist, both the missing and P» 
genotypes have the same marker type, i.e., absent of the dominant band, and 
therefore miscoding may occur. 


Type of Type of Type of Possible types in temporary Possible types in 
parent P, parent P, hybrid F, populations permanent populations 
= — ı— ea 
2 0 1 12 0 2 0 


Fic. 1.5 — Dominant marker and the coding method in software QTL IciMapping. 
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On the other side when one band is present in Py but no band can be observed in 
P,, such as the thin solid band shown in figure 1.6, F; shows the same band as P3. 
This can be taken as the band of P, is absent in F}, and the band as a marker is 
called recessive. Two types can be observed in temporary populations, i.e., present 
and absent of the recessive band. Individuals showing the band may be heterozygous 
(same as F,) or homozygous (same as P2). Genotypes of these individuals cannot be 
completely determined without additional information. Therefore, the recessive 
band is coded as 10, representing two possible genotypes of F; (coded as 1) and P% 
(coded as 0). When no band is observed in temporary populations, 2 is given to 
represent the genotype of P, (figure 1.6). For permanent populations without 
heterozygotes, present and absent of the recessive band are coded as 0 and 2 for two 
homozygotes, respectively. Similar to dominant markers, miscoding may occur when 
missing types are also present, since the two homozygotes cannot be told from 
physical missing and absence of the recessive marker. 


Type of Typeof Type of Possible types in temporary Possible types in 
parent P, parent P, hybrid F, populations permanent populations 
2 0 1 2 10 2 0 


Fic. 1.6 — Recessive marker and the coding method in software QTL IciMapping. 


In figures 1.4-1.6, three coding values 2, 1, and 0 respectively for genotypes P4, 
F,, and Ps can be understood as the number of the allele harbored by parent P4, or 
the P; allele in short, in the three genotypes. Since version 4.0, QTL IciMapping also 
provides the coding by single or double characters. The equivalent relationship 
between coding by numbers and by characters is given in table 1.2. For co-dominant 
markers, coding numbers 2, 1, and 0 can be replaced by single characters A, H, and 
B, or double characters AA, AB (or BA), and BB. The three coding rules are 
completely equivalent in the QTL IciMapping software. When double characters are 
used, the order of single letters does not have any effect. For example, AB and BA 
have the same meaning in the software QTL IciMapping, both representing the 
heterozygous genotype in the genetic population. For dominant markers, two 
genotypes AB and AA cannot be separated, and their mixture genotype is coded as 
number 12 (figure 1.5), or single character D, or double characters AH, AX, A*, and 
A , etc. (table 1.2). For recessive markers, two genotypes AB and BB cannot be 
separated, and their mixture genotype is coded as number 10 (figure 1.6), or single 
character R, or double characters BH, BX, B*, and B , etc. (table 1.2). 

In addition to marker types shown in figures 1.4-1.6, missing and non-parental 
types also exist occasionally. As indicated earlier, missing genotypic values could 
cause misclassification for dominant and recessive markers. Reasons for the creation 
of non-parental types include DNA sample confounding, pollution by external 
pollens, natural mutation, etc. (Bernardo, 2010). Non-parental types can hardly be 
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TAB. 1.2 — Genotypic coding of molecular markers in software QTL IciMapping. 


Coding method Genotype 

AA AB BB AB+AA AB + BB Missing 
By number 2 1 0 12 10 -1 
By single character A H B D R X, * 
By double AA AB, BB AH,HA, BH, HB, XX, ** 
characters BA AX, XA, A*, BX, XB, B*, 


“AA, A “BB, B 
Notes: AA and BB are genotypes of two homozygous parents P) and Py; AB + AA 
represents heterozygous genotype AB cannot be separated from homozygous genotype 
AA (i.e., dominant markers); AB + BB represents heterozygous genotype AB cannot 
be separated from homozygous genotype BB (i.e., recessive markers); sometimes, 
AB + AA is also denoted as A” or A_, and AB + BB is denoted as B* or B_, where 
symbol * or _ represents the position can be either allele A or B. 


used in genetic analysis, which is normally treated as missing values as well. In QTL 
IciMapping, missing marker types are represented by number —1, single character X 
or *, or double character XX or ** (table 1.2). 

By the way, the coding of missing phenotypes is briefly mentioned here. 
Phenotypic values are normally represented by real numbers. In QTL IciMapping, 
missing phenotypic values are represented by symbols NA, na, *, or the full stop 
sign, which has been adopted by various genetic analysis software packages. Value 
—100.00 was previously used to stand for missing phenotypes in earlier versions of 
QTL IciMapping. This coding method is still applicable in 4.0 and higher versions. 
However, missing phenotypes can only be either —100.00 or any character symbols 
mentioned above. Mixed coding on phenotypes is not allowed in one population. 

Co-dominant, dominant, and recessive markers may exist simultaneously in one 
specific population. However, one specific marker belongs to one and only one of the 
three categories. Taking number coding as an example, when the three numbers 2, 1, 
and 0 exist simultaneously at one marker locus in one population, this marker is 
defined as co-dominant by the software QTL IciMapping; when number 12 exists, 
this marker is defined as dominant; when number 10 exists, this marker is defined as 
recessive. In other words, four potential genotypes at one co-dominant marker locus 
are AA, AB, BB, and missing; three potential genotypes at one dominant marker 
locus are AA + AB, BB, and missing; three potential genotypes at one recessive 
marker locus are AA, AB + BB, and missing. When two numbers 2 and 12 both 
appear at one specific marker locus, the software will treat the input information as 
invalid, and the program will stop. Like many other software packages, QTL 
IciMapping has special requirements on the format of input data so as to define the 
genetic population properly and then load the population to the software for 
analysis. The software checks for any invalid values on genotype and phenotype in 
the population to be analyzed. When invalid values are identified in input files, the 
program will stop but output relevant error messages, which may help the users to 
identify where the invalid values may occur in input files. Users have to correct the 
invalid values and reload the population to the software. 


30 31 32 3) M 35 36 37 38 39 40 41 42 4) 44 45 46 47 48 40 
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DH lines 


Mom nm M s 26 x xs 29 


-1 Missing 


BM is 16 17 18 19 20 


O Parent TR306 


1 2 


ALM. 


2 Parent Harrington 


As an example, figure 1.7 shows genotypes at 14 marker loci in a bi-parental 
barley population consisting of 145 doubled haploid (DH) lines (Tinker et al., 1996). 
By the way, the software QTL IciMapping allows mixed coding on genotypes. For 
example, in figure 1.7, DH lines 1 and 3 were coded as 0 and —1, respectively, at 


Marker name 


for missing values. From figure 1.8, it can be seen that M03 is a dominant marker 
having three valid values D, B, and X; M06 is a recessive marker having three valid 
values A, R, and X; the others are co-dominant markers having four valid values A, 
H, B, and X. From chapters 2-6 in this book, the authors will use the two popu- 


lations given in figures 1.7 and 1.8 as examples to the most extent. 
and clear, mixed coding on marker types is not encouraged. For one specific popu- 


any difference to the program and analysis. In figure 1.8, individuals 1 and 2 are 
coded as X and H, respectively, at marker M1-1. To replace X with —1, XX, * or **, 
and H with 1, AB or BA will not make any difference either. However, to be consistent 
lation, either number or character, but not both, should be consistently adopted. 


heterozygous genotype, D for dominant genotype, R for recessive genotype, and X 
marker Act8A. Replacing 0 with B or BB, and —1 with X, XX, * or ** will not make 


permanent population of pure lines, numbers 1, 10, and 12 cannot be seen in 
figure 1.7. As another example, figure 1.8 shows genotypes at 12 marker loci in one 
bi-parental Fə population consisting of 111 individuals. Single characters are given 
to marker genotypes, where A and B stand for two parental types, H for 


Numbers are given to marker genotypes, where 2 stands for the Harrington parental 
type, 0 for the TR306 parental type, and —1 for missing values. Since this is a 


Two parents of the barley DH population are cultivars Harrington and TR306. 
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Fic. 1.7 — Genotypic data of 14 molecular markers in a barley population consisting of 145 
doubled haploid (DH) lines. Notes: 2 stands for the Harrington parental type, 0 for the TR306 


parental type, and —1 for the missing value. 
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A: genotype AA H: genotype AB B: genotype BB D: genotype AB and AA R: genotype AB and BB X: missing value 
Marker Individuals in the F population 
name T 2? 3 4 5 6 7 8 9 101112 13 1415 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 
M-I X H H B H B A H H BH H H H B A H A H A A BH H BH A A H HB HA B B H H 
M2 H H H B H B A X H B H H HH B AH A HA A BH H BH A A H H BH A B B H H 
Mİ33DDDBDBDDDBDDDDBDDDDDDBDDBDDDDDBDDBBDD 
M-4 H H BB H B AH H X H HH HB AH AH A A BH H BH A A BH BH A B BHA 
MI-5 H H B B H B AH AB H HH HB A HA HA AB HH BH H A BH BH AH BHA 
M6 R R R RR R ARR RX R RRR AR A AA ARRRRA RA RR RRA RRRA 
MI-7 H H B B H H H H H BH X HH B AH AA A A BH HBA HA BH BH AH BH A 
MI-8 H B BB H H H H H B H H X A B AH A A A A BH HB A H H B H B A A H BH A 
M-9 H B B B H H H H H BH H HH B AH A A A A BH HB A H H BH B A A H H HA 
MI-10 H B B B B HH H H B HB HH H AH A A A A BH HB A H H BH H A A H H HA 
MI-II H B H B B H H H H BH B HHH AH A A A AB HH BA H H B AH A A H H HA 
MI-I2 H_B H B B H H H BHH B H H H A _H A A A_A_B_H H B A_H A _H AHA _A_H H H_H 

38739 40 41 42 43 44 45 46 47 48 49 50 SI 52 53 54755 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 
Mi- A A HA B H H H H AH AH BAH AA AH A AA A BAB BH A B H BH A B A BH 
MI2 A A H A B H H H H H H H BH H HH H AH HH BH BH H AH H BH A BABA 
Mİ33DDDDBDDDDDBXBDDDDDDDDDBDBDDDDDBDDBDBD 
M-4 A HA ABHBHHHBHBHAHAHAAHHBHBH HA HHBHABABA 
M5 A HA A B H B H H H H H BH AH AH A AH HHH BH HA HHBHABA BA 
M6 A R A A R RR AR AR RR A AR AR A ARRARRRRARRRRARARA 
M-7 A H A A B H B A H A H HB A AH AB A AH HHH BH HA H H BH A BH BA 
M-8 A HA A B H B H H AH H X A A H AB A AH HHH BH HA H H BH A BH BA 
Mİ9 A H A A B A B H H HH HB A AH AH A AH HH HB H HA H H BH A BH BA 
MI-10 A H H X B A B H H H H HB A A H AH A AH HH HB H HA H H BH A B ABA 
MI-I A H HA X A B H H H H H B A A H A A A AH HH H HH HA H H BH A B ABA 
MI-I12 A HHABABHHHHHBAAAHAAHHHBHHHHHHABHABABA 

75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 
M-I BA BH A BB HHHBABAB H HB BB HHBHHA HB H H A A A H H A 
MI2 BA B B A B H B HB B AB A X A HB BB HH BH HA HBHHAAA H H H 
MI33 B DBBDBDBDBBDBDBDDBBBDDBDDDDBDDDDDDDD 
M-4 B A BB H B H B HB B AB A BH HX B HH H BH AH H BH HA A H H H H 
M-5 B H B B H B H B HB B AH AB HH BX H HH BH AH HB H HA A H HA H 
M6 R R R RR RRR RRR AR RRRRRRXRRRRAR RRRR A AR RAR 
MI-7 B H B B H H H B HB H AB HH H HB B AH HB H AH HB H HA A H HA B 
MI-8 B B BB H H H B HB H A BH B HH HB AX H BH AH H BH HAA H H A B 
M-9 B BB B H H H B HB H AB HB HH HH A HH BH AH HB AH A A H H A B 
MI-10 B B B B H H H B H BH AB HB HH HH A HH BH AH HB AH A A H H A B 
MI-I B B B B H H H B HB B AH H BH HH HA HH BH A A HB A H H A BB A B 
MI-2 B B B B HH H BH B B A _H H B H H H H A_H H B H A _A_H B A _H H A_B_ BAB 


Fic. 1.8 — Genotypic data of 12 molecular markers in an Fə population consisting of 110 
individuals. Notes: A and B stand for two parental types, H for heterozygous genotype, D for 
dominant genotype, R for recessive genotype, and X for missing values. 


1.2.2 Gene Frequency and Genotypic Frequency 


Gene (or allele) frequency and genotypic frequency at one particular locus or a set of 
loci are the two most important parameters to characterize genetic populations. 
Genetic loci are specific chromosomal locations that may harbor a number of alleles 
responsible for phenotypic differences or molecular-level polymorphisms with no 
obvious biological functions. Assuming at one locus there are two alleles represented 
by A and a, one diploid population consists of n individuals (n is called the popu- 
lation size), and n44, TA, and Naa are observed sample sizes of the three genotypes 
AA, Aa, and aa in the population. In diploid species, each individual carries two 
genes at the locus, resulting in a total number of 2n genes. Individuals of genotype 
AA each carry two alleles of A, those of genotype aa each carry two alleles of a, and 
those of genotype Aa each carry one A allele and one a allele. Therefore, frequencies 
of the two alleles can be estimated by equation 1.2. For convenience, frequencies of 
the three genotypes AA, Aa, and aa are given in equation 1.3. 


QnaAt+Nsa NAAF NAa NAa H 2Naa S NAa + Naa 
PA = — , Da — - (1.2) 
2n n 2n n 
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1.2.3 Fitness Test on Genotypic Frequencies 


In genetic populations where the derivative procedures are known, such as many 
bi-parental populations as shown in figure 1.1, gene frequency and genotypic fre- 
quency have some expected values when selection can be ignored. That is to say, 
different genotypes satisfy some expected ratios or Mendelian ratios. For example, 
the expected ratio of three genotypes AA, Aa, and aa is 1:2:1 in the Fz population. 
For this reason, the fitness test on genotypic frequencies in genetics is also called the 
Mendelian ratio test. The accepted fitness between observed and expected 
frequencies indicates no selection occurred on the tested locus, or the inheritance 
of the tested locus can be explained by Mendelian laws. On the other side, 
the rejected fitness indicates selection may have occurred, or segregation of the 
tested locus is distorted from the tested Mendelian ratio. Therefore, the fitness test 
on genotypic frequencies is also called the segregation distortion test. 

Fitness on observed genotypic frequencies in one population can be tested by one 
x” statistic given in equation 1.4, where O is the observed sample size of each 
genotype, E is the expected sample size calculated from the expected Mendelian 
ratio, and the summation is across all genotypic groups. The degree of freedom of the 
X” test is equal to the number of genotypic groups in the tested population minus 
one. 


2 
yoy (14) 
Given in table 1.3 is an Fə population in wheat, where the cross (i.e., Fı) was 
made between one disease-resistant (i.e., P)) and one disease susceptible (£.e., P2) 
wheat variety. In screening for molecular markers which are genetically linked with 
the resistant gene, one co-dominant marker was identified. Out of 2341 individuals 
in the Fə generation, 575, 1183, and 583 were identified to have the same genotypes 
as Pi, F4, and Ps, respectively. From equation 1.2, it can be seen that the observed 
frequencies of two parental alleles are 0.4983 and 0.5017, respectively, close to the 
expected frequency of 0.5. If no distortion or affecting factors occurred on the 
population structure, three marker types should follow the Mendelian ratio 1:2:1, 
based on which the expected sample sizes can be calculated (table 1.3). Obviously, 
expected and observed sample sizes are not exactly equal. From equation 1.4, x 
statistic is calculated to be 0.3217, and its significance probability at 2 degrees of 
freedom is P = 0.8514, indicating that the difference between observed and expected 
frequencies is not significant at the 0.05 significance level. In other words, the dif- 
ference between the observed frequencies from their expected values can be com- 
pletely caused by chance. No distortion or selection occurs at the marker locus. For 
the disease trait, only two phenotypes can be observed. In the Fə population, 1747 
are identified to be resistant, and 594 are susceptible to the disease. The fitness test 
against the expected ratio of 3:1 resulted in a significance probability P = 0.6762, 
indicating that the disease resistance is controlled by one dominant gene. 
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TAB. 1.3 — Test of fitness for one co-dominant marker and one dominant disease gene in an Fə 
population in wheat. 


Sample size and test statistic Ratio 1:2:1 of co-dominant Ratio 3:1 of resistant 
marker gene 


Type2 Type 1 Type 0 Resistant Susceptible 


Observed sample size 575 1183 583 1747 594 
Expected sample size 585.25 1170.50 585.25 1755.75 585.25 
(O- E)/E 0.1795 0.1335 0.0087 0.0436 0.1308 
L=% (0- E)/E 0.3217 0.1744 

Significance P 0.8514 0.6762 


Take marker Act8A in figure 1.7 as another example. At this locus, nga = 74, 
NAa = 0, Naa = TÜ, and n = maa + maa + Naa = 144. The marker type is missing for 
the 3rd DH line. Allele A is harbored in parent Harrington whose genotype is coded 
by number 2. Allele a is harbored in parent TR306 whose genotype is coded by the 
number 0. The observed frequencies of the two alleles are 0.5139 and 0.4861. Since 
the heterozygous genotype does not exist in DH populations, genotypic frequency is 
the same as gene frequency. When distortion does not occur, two genotypes will 
follow the expected ratio of 1:1. Fitness test acquires a significance P = 0.7389 
(table 1.4), indicating that the genotypic frequencies observed at this marker can be 
fitted by their expected segregation ratio of 1:1. Table 1.4 gives the fitness test on all 
14 markers listed in figure 1.7. No distortion occurs on any of them in the barley DH 
population. 


TAB. 1.4 — Test of fitness for 14 molecular markers in one barley DH population. 


Marker Sample size Frequency of the y? Significance 
name allele in parent probability 
Type2 Type0 Missing Harrington 
Act8A 74 70 1 0.5139 0.1111 0.7389 
OP06 T2 69 4 0.5106 0.0638 0.8005 
aHor2 75 61 9 0.5515 1.4412 0.2299 
MWG943 76 61 8 0.5547 1.6423 0.2000 
ABG464 77 63 5 0.5500 1.4000 0.2367 
Dor3 73 68 4 0.5177 0.1773 0.6737 
iPgd2 74 71 0 0.5103 0.0621 0.8033 
cMWG733A 76 67 2 0.5315 0.5664 0.4517 
AtpbA 74 69 2 0.5175 0.1748 0.6759 
drung 73 T2 0 0.5034 0.0069 0.9338 
ABC261 T2 T2 1 0.5000 0.0000 1.0000 
ABG710B 70 73 2 0.4895 0.0629 0.8019 
Aga7 70 75 0 0.4828 0.1724 0.6780 
MWG912 70 71 4 0.4965 0.0071 0.9329 
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For convenience, table 1.5 gives the expected frequency of the allele in parent P4, 
and the expected frequencies of three genotypes AA, Aa and aa in 20 bi-parental 
populations as shown in figure 1.1. If allele A is dominant to allele a, genotypes AA 
and Aa cannot be separated. In this case, the expected frequency of mixed genotype 
AA + Aa (or A ) can be acquired by summation of two frequencies of AA and Aa. 
If allele A is recessive to allele a, genotypes Aa and aa cannot be separated. In this 
case, the expected frequency of mixed genotype Aa + aa (or a_) can be acquired by 
summation of two frequencies of Aa and aa. It is worth mentioning that mixed 
genotype A_ or a_ can be further determined by progeny testing, e.g., the phe- 
notypic segregation in their selfed F families (see exercise 1.9 for details). 


TAB. 1.5 — Expected (or theoretical) gene and genotypic frequencies in 20 bi-parental 
populations. 


Population Population Frequency of allele Expected (or theoretical) frequencies 

number name A in parent Pı Genotype Genotype Genotype 
AA Aa aa 

1 P\BC,F, 0.75 0.5 0.5 0 

2 P,BCiF, 0.25 0 0.5 0.5 

3 F,DH 0.5 0.5 0 0.5 

4 F RIL 0.5 0.5 0 0.5 

5 PıBCıRIL 0.75 0.75 0 0.25 

6 P,BC\RIL 0.25 0.25 0 0.75 

7 Fə 0.5 0.25 0.5 0.25 

8 F; 0.5 0.375 0.25 0.375 

9 PıBCF; 0.875 0.75 0.25 0 

10 P2BC2F 0.125 0 0.25 0.75 

11 P\BC2RIL 0.875 0.875 0 0.125 

12 PəBCRIL 0.125 0.125 0 0.875 

13 P:BC:Fə5 0.75 0.625 0.25 0.125 

14 P2BC,F2 0.25 0.125 0.25 0.625 

15 P:BCəFəŞ 0.875 0.8125 0.125 0.0625 

16 P2BCoF 2 0.125 0.0625 0.125 0.8125 

17 P,BC,DH 0.75 0.75 0 0.25 

18 P,BC,DH 0.25 0.25 0 0.75 

19 P,BC,DH 0.875 0.875 0 0.125 

20 P2BC2DH 0.125 0.125 0 0.875 


1.3 Genetic Effect and Genetic Variance 


1.3.1 Calculation of Population Mean and Phenotypic 
Variance 


Effects of genes on a trait of interest can be observed or measured from the 
phenotypic performance of individuals or families with different genotypes. 
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Genetic effects are normally unknown parameters to be estimated from phenotypic 
observations. Assuming one genotype has a mean performance of u in one given 
environment, and the variance of random error included in phenotypic observations 
is a , both parameters are unknown and need to be estimated. In most situations, 
mandan error follows the normal or Gaussian distributions with a mean of 0. 
Therefore, the phenotypic performance of one specific genotype in one specific 
environment follows a normal distribution with a mean of u and variance of oğ, 
Assume there are r times of independently replicated observations. These observa- 
tions can be treated as a number of samples from a statistical population consisting 
of all possible phenotypic values of the genotype. Therefore, observations can be 
represented by a distribution model in equation 1.5 or a linear model in 
equation 1.6. 


y~ N(u, oğ) (k = 1,2,3,...,7),independent and identical distribution (iid) (1.5) 


Yk = H+ Ex, where gp ~ N(0, 0?) (k = 1,2,..., r) iid (1.6) 


In statistics, the sample mean is defined by 7. = 4 + ok Yk, and the distribution of 


sample mean is 7. ~ N(u, @ T, Obviously, the sample mean 7. is an unbiased estimate 
of the unknown phenotypic mean Hu, 77 in equation 1.7. 


Deviation of one observation to sample mean, represented by yk — J., measures 
how far away the observation to sample mean, which in certain sense reflects the 
error effect included in the observation, i.e., y — u. Deviation yp — ¥. is sometimes 
called the estimate of error effect y — u. Let SS, be sum of squares of all deviations, 
ie., SS, = >>, (yr — 7.)”. In statistical analysis of variance (ANOVA), SS, is also 
called the sum square of errors or the sum square of residuals, which can be directly 
calculated from observations. Taking observations as random variables SS, is also a 
random variable. Expectation or mean of the random variable SS, can be proved to 
have the following relationship with error variance : vrhich needs to be estimated. 


= Bw 9)? = FX)m--0.-əy 
= 2. E(m — u) — 2EG. — u) . (yr — u) + rE(G. — p)? 


= 2. E(w- u) — 2EG. — u)r(y. — u)  rE(y. — u)” 
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where the coefficient (r — 1) ahead of o? is called the degree of freedom of errors. As 
there is one restriction over the r error terms, i.e., the sum of all errors is equal to 0, 
only r — 1 error terms are independent. Thus, the error degree of freedom can be 
treated as the number of independent error terms. The sum square SS, divided by 


degree of freedom is called the mean square of errors, denoted by MS,. Therefore, the 
2 


g3 
£ 


mean square MS, gives an unbiased estimate of the unknown error variance o 
represented in equation 1.8. 


ô? = MS,, where MS, = -. and E(MS,) = o? (1.8) 

In equation 1.5, two parameters associated with a normal distribution were 
introduced to characterize the phenotypic distribution of one genotype in one 
specific environment. Up to now, their unbiased estimates have been derived and 
shown in equations 1.7 and 1.8. In statistics, it can be further proved that the 
estimate given by equation 1.7 has the minimum variance among all unbiased linear 
functions of the r observations, and is therefore called the best linear unbiased 
estimate (BLUE). 

Table 1.6 gives plant height in two rice parental lines, and their F, and Fə, 
populations. 10 individual plants were measured each in two parental and their Fy, 
hybrid populations; 30 individuals were measured in their Fə population. From 
equation 1.7, it can be easily found that the estimate on the mean of plant height 
was 160.40 cm for the tall parent, 103.00 cm for the short parent, 148.80 cm for 
their F,, and 139.73 cm for their Fy. Estimates of plant height mean in F; and Fə 
were located between two parental populations. From equation 1.8, it can be found 
that the estimate of population variance was 24.71 cm? for the tall parent, 34.89 cm? 
for the short parent, 20.40 cm? for their F,, and 692.13 cm? for their Fs. As there is 
only one genotype included in two parental and their F, hybrid populations, 
variances in the three populations were caused totally by random errors. 


TAB. 1.6 — Observations of plant height in two rice parental lines and their Fı and Fə 
populations. 


Population Observations of plant height (cm) Samplemean Sample variance 
(cm) (em?) 
Tall parent 155, 161, 150, 164, 165, 161, 160, 158, 160.40 24.71 
166, 164 
Short 97, 109, 92, 103, 109, 104, 98, 106, 102, 103.00 34.89 
parent 110 
F, 156, 148, 140, 150, 148, 147, 146, 155, 148.80 20.40 
148, 150 
Fə 89, 157, 149, 169, 123, 158, 151, 83, 167, 139.73 692.13 


154, 152, 167, 116, 146, 97, 147, 162, 159, 
111, 143, 144, 124, 137, 156, 80, 169, 157 
152, 157, 116 
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Estimates of the three population variances were similar, which can be combined to 
estimate the error variance. Assuming that the two parents carry different alleles on 
plant height, multiple genotypes occur in their Fy population. In addition to random 
error, the difference in genotypes also causes the variation in plant height, i.e., 
genetic variance. It can be seen from table 1.6 that the estimated variance in the F, 
population was much higher than estimates in the other three populations, indi- 
cating the existence of significant genetic variance in plant height in the Fə 
population. 


1.3.2 One-Locus Additive and Dominance Model 


In one population consisting of one single genotype, variation in the population is 
solely ascribed to random errors. Variances in such populations, such as inbred 
parents and their F,, can therefore be used to estimate error variance, which has 
been seen previously in table 1.6. In genetic studies, however, genetic variance is of 
great concern, together with genotypic values, and the effects of genes. In addition, 
the importance of genetic variance can be expressed by its proportion to phenotypic 
variance, which is normally called heritability in the broad sense, one paramount 
concept and parameter in quantitative genetics. In the following two sessions, the 
most fundamental one-locus model will be used as an example to explain the genetic 
effect and genetic variance. Calculation of heritability will be covered in §1.6. 

Assuming there are two alleles, i.e., A and a, located at one locus, 44, Haas and 
Haa are population or genotypic means of parent P, (AA), parent Ps (aa), and their 
F; (Aa), respectively, on a trait in interest. The assignment of two parents as P) and 
P» is totally arbitrary. Let m be mid-parental value, i.e., m = 5 ( HAA + Haa), Distance 
of both parents to mid-parent is defined as the additive effect of the two alleles at the 
locus, represented by a = $ (444 — Haa). The deviation of F; to mid-parent is defined 
as a dominant effect, represented by d (figure 1.9). Therefore, the means of three 
genotypes at the locus can be expressed by equation 1.9. 


Hag, = m+ a, Hag = m+ d, Ha = m-a (1.9) 
Genotype Pz: aa Mid-parental Fı: 4a P,: AA 
Genotypic value m - a value m m+d mt+a 


n — —— ——— — — —— —.. 


Dominant effect d 


— ,— 


Additive effecta Additive effect a 


Fic. 1.9 — The one-locus additive and dominant genetic model in quantitative genetics. 


In figure 1.9, the mean of genotype AA does not have to be higher than genotype 
aa; the mean of genotype Aa does not have to be higher than mid-parent either. 
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Both additive and dominant effects can be positive, negative, or even 0. When one 
locus has been identified to affect one targeted trait in breeding, the source of 
favorable allele in parents and whether heterozygote outperforms both homozygotes 
need to be further clarified in order to apply the genetic mapping result into 
breeding. In equation 1.9, a positive additive effect indicates that parent Pı (AA) 
carries the allele increasing the mean performance of the trait. If higher performance 
is desired in breeding, parent P, (AA) carries the favorable allele, or allele A is 
favorable. On the contrary, when additive effect a is negative, Pı (AA) carries the 
allele decreasing the mean performance: Po (aa) carries the allele increasing mean 
performance. Parent Ps (aa) carries the favorable allele, or allele a is favorable. 

The ratio of dominant to additive effects is called the degree of dominance. 
Degree of dominance d/a > 1 indicates the presence of positive over-dominance, 
where heterozygote Aa is located towards the direction of homozygote AA and 
outperforms the mean performance of genotype AA. d/a — 1 represents positive 
complete dominance, where heterozygote Aa has the same performance as genotype 
AA. 0 < d/a < 1 represents positive partial dominance, where heterozygote Aa is 
located towards the direction of homozygote AA, but its mean performance is lower 
than genotype AA. Similarly, negative over-dominance, negative complete domi- 
nance, and negative partial dominance can be defined accordingly. 

Assuming plant height shown in table 1.6 is controlled by two alleles at one locus. 
The genotype of the tall parent is AA with a mean plant height of 160.40 cm; the 
genotype of the short parent is aa with a mean plant height of 103.00 cm; the 
genotype of the F, hybrid is Aa with a mean plant height of 148.80 cm. From 
equation 1.9, it can be found that mid-parent m = 131.70, additive effect a = 28.70, 
and dominant effect d — 17.10. Degree of dominance d/a = 0.5958 at the locus, 
indicating that the tall-parent allele A is positive partial dominant; the plant height 
of heterozygote Aa is located in the direction of tall-height genotype AA but is lower 
than the height of homozygous genotype AA. Due to the positive additive effect, the 
allele in the tall parent P, (AA) increases plant height; the allele in the short parent 
P? (aa) decreases plant height; the allele to reduce plant height exists in the short 
parent with a mean plant height at 103.00 cm. 

Of course, if wanted, the short parent can also be assumed to be AA, and the tall 
parent to be aa. From equation 1.9, it can be found that m = 131.70, a = —28.7Ü, 
and d = 17.10. In this situation, d/a = —0.5958, indicating that the short-parent 
allele A is negative partial dominant; the plant height of heterozygote Aa is located 
in the direction of tall-height genotype aa. Due to the negative additive effect, the 
allele in the short parent Pı (AA) reduces plant height, and the allele to reduce plant 
height still exists in the short parent. 


1.3.3 Population Mean and Genetic Variance at One 
Locus 


Let faa, faa, and faa be frequencies of three genotypes AA, Aa, and aa in one 
population at one locus. Under the additive and dominance model, the population 
mean and genetic variance can be calculated by equations 1.10 and 1.11, respectively 
(see also exercise 1.10). 
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u= faa(m-+ a) + faa(m-+ d) + faa(m — a) (1.10) 
= m+ (faa — fra) at fAad f 

o% = faa(m+ a)” + faa(m+ d)? + faa(m — a)” — we 
= [faa + faa — (faa — fa) |e" — 2faa(faa — faa)ad + (faa — VALI 
In permanent populations without heterozygotes, f4a = 0, and population mean 


and genetic variance can be acquired by equations 1.12 and 1.13, respectively, 
independent of dominant effect d. 


u= m+ (faa — faa)a (1.12) 


(1.11) 


te = [1 = (faa = Tal” (1.13) 


One major objective of genetic studies is to locate specific genes or chromosomal 
regions which are responsible for phenotypic differences or variations in the trait of 
interest. The larger the genetic effects, the greater phenotypic variation will be 
explained by the locus, and the easier the genes located at the locus would be 
detected. It can be seen from equations 1.11 and 1.13 that genetic variance depends 
not only on genetic effects but also on genotypic frequencies in the population. When 
the dominant effect is not considered, i.e., d = 0 or permanent populations, it can be 
seen from equations 1.11 and 1.13 that the maximum genetic variance is achieved 
when f44 = faa = 0.5. Genes in such populations, e.g., FıDH and F,RIL in 
figure 1.1, are easier to be identified. When a= 1 and d= 0, figure 1.10 shows 
genetic variances in 20 bi-parental populations as indicated in figure 1.1. It can be 
seen that genetic variance is reduced after one or two generations of backcrossing. 
Therefore, under the assumption of homogeneous error variance in phenotyping, the 
same genes would explain fewer variations in backcrossing populations, which would 
cause negative effects on genetic analysis. After three or more generations of back- 
crossing, the frequency of one genotype becomes closer to 1, and the other one closer 
to 0. Genetic variance is further reduced in advanced backcrossing populations, 
making them less suitable for conventional genetic studies by linkage analysis. For 
this reason, when the target of the genetic study is to identify additive genes with 
breeding applications for selecting elite pure lines as cultivars, bi-parental DH and 
RIL populations derived from the F, hybrid are most suitable, followed by F3 
population, and DH and RIL populations derived from one generation of 
backcrossing. 

If the dominant effect cannot be ignored, for example, in genetic studies with 
applications for selecting elite hybrids as cultivars, heterozygotes have to be present 
in genetic populations in order to estimate the dominant effect. It can be seen from 
equation 1.11 that genetic variance depends on additive effect a, dominant effect d, 
and frequencies of the three genotypes AA, Aa, and aa. For example, figure 1.11 
shows genetic variances for three degrees of dominance, i.e., 0.5, 1, and 1.5, in 
temporary populations as indicated in figure 1.1. For the degree of dominance 0.5, 
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Fic. 1.10 — Genetic variances at a = 1 and d = 0 in 20 bi-parental populations. 
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Fic. 1.11 — Genetic variances at three levels of dominance in various populations. 


the F3 population has the largest genetic variance. For degrees of dominance 1 and 
1.5, the backcross population with parent Py has the largest genetic variance. 
Obviously, when the dominant effect is present, populations with equal allele 
frequencies may not always give the highest genetic variance. 

When dominant and additive effects are located in the same direction, i.e., 
degree of dominance d/a > 0, genetic variance is increased in the backcross popu- 
lation with parent Pə but is reduced in the backcross population with parent P» 
(figure 1.11). Similarly, when dominant and additive effects are located in two 
opposite directions, i.e., degree of dominance d/a < 0, genetic variance is increased 
in the backcross population with parent Pı but is reduced in the backcross popu- 
lation with parent Pj. When considering the trait of interest may be controlled by 
genes on multiple loci, positive and negative degrees of dominance may both appear, 
and genetic variance in the backcross population to either parent may not exceed the 
genetic variance in Fy or F population. 

In summary, when a number of genes are involved in the inheritance of the trait 
in interest, which is likely the case for most quantitative traits, there is a lack of 
knowledge or information on gene actions. It would be always helpful to first 
consider using populations with equal allele frequencies in genetic studies. These 
populations, once developed, would be also used for many other traits showing the 
difference between the two parents. 
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1.4 ANOVA on Single Environment Trials 


Phenotypes of biological individuals come from the combined actions of genotypes 
and environments. To publish research articles on genetic studies, many scientific 
journals request the authors conduct multi-environmental phenotyping trials, 
analysis of variance (ANOVA) on phenotyping trials, and estimation of variance 
components and trait heritability on the studied traits. Environments mentioned 
here could be a number of locations or sites in one cropping season, a number of 
seasons or years in one given location, or the combinations of multi-locations and 
multi-years. Linear models and parameter estimation methods in relation to 
ANOVA will be introduced for single environmental and multi-environmental trials 
in §1.4 and §1.5, respectively. 


1.4.1 Linear Decomposition on Phenotypic Observation 


Assume one genetic population consists of a number of g genotypes, which have been 
phenotyped in one specific environment. Each genotype has a number of r replicated 
observations. Let u; be the mean performance of the ith genotype, which is to be 
estimated. Under the assumption that the error effect follows the normal distribu- 
tion with mean of 0 and variance of o? (to be estimated as well), the kth observation 
of the ith genotype, i.e., Yip follows the normal distribution with mean of u; and 
variance of o? (equation 1.14). 


yin ~ N(u,, o?) (1.14) 


Let m. be the overall mean of g genotypes, i.e., M. poi His and G; be the 
deviation of the ith genotypic mean to the overall mean, i.e., G;=("; — H.). 
The observation given in equation 1.14 can be rewritten in a linear model, 
i.e., equation 1.15. 


Yik = H. + Git ein, (1.15) 


where e ~ N(0, o?) (i= 1, 2, ..., g; k= 1, 2, ..., r) iid. 
In equation 1.15, normal distributions of error effects are required to be inde- 
pendent, and identical, which can be satisfied by suitable field experimental design 
and proper sampling techniques, such as the randomized complete block design and 
random sampling. Theoretical genetic variance, i.e., a, can be defined from the 
g genotypic means in equation 1.16. Given next is the estimation of genetic variance 
oz and error variance o? from observation yj, based on which the broad-sense 
heritability can be calculated, i.e., proportion of genetic variance g% to phenotypic 
variance o>. In single environmental trials, phenotypic variance is the sum of genetic 
and error variances, i.e., op = 0% + 0°. 
1 gö 


— : 1.1 
ə (1.16) 


2 a 
og>= 
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1.4.2 Decomposition of Sum of Squares of Phenotypic 
Deviations 


Under the assumption given in equation 1.15, the overall sample mean, i.e., y.., is the 
simplified average across ” observations, which follows a normal 7110 witha 
mean of M. and variance of & , 1.€., Ye əə p Yik and Y.. ~ NÜL, on Sample mean of 
the ith genotype, i.e., Yi., is .. simplified average across r replications which follows 
a normal distribution with mean of u; and variance of = = 7), z Yik and 


Ji Nu, 2), Therefore, the deviation of each eer to ə overall sample 
mean, which can be acquired from equation 1.15, can be further decomposed into two 
items represented in equation 1.17. 


Yik — Y = (Ji — Y) + (Yik — 75) (1.17) 


On the right side of equation 1.17, the first term is the deviation of the sample 
mean of each genotype to the overall sample mean, which can be used to estimate 
the genetic effect G; and genetic variance oz: the second term is the deviation of 
each observation to sample mean of the genotype, which can be used to estimate 
error effect €; and error variance a. Let SS? be the total sum of squares of all 
observations to the overall sample mean, which can be decomposed as follows into 
two items. 


SS7 = ə. (Yir — 7.) = 3 (ya — 95 + (r — 7.) 


Let SS, be sum of squares of error effects, i.e., SS, = 27) (Yik — Ji)”, which 
contains the information on error variance oğ. Let SSç be sum of squares of genetic 
effects, i.e., SSe = r 37) (Yj. — 7..)”, which contains the information on genetic 
variance ge Thus, the total sum of squares can be decomposed as the summation of 
two components SSg and SS,, i.e., 


SS r = SSG + SS; 


The sum of squares of error effects, i.e., SS,, and the sum of squares of genetic 
effects, i.e., SSg, can be immediately acquired once the replicated phenotypic data is 
collected. The relationship between the sum of squares and variance components is 
needed so as to estimate the two unknown variances, i.e., o? and oz. From the 
normal distributions which are followed by observations, sample means and overall 
sample mean, the following relationship can be acquired between the expectation of 
SS, and the unknown error variance o. 
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where coefficient g(r — 1) ahead of o? is called the degree of freedom of errors. Let 
the mean of squares of error effects, i.e., MS,, be equal to the sum of squares divided 
by the degree of freedom. The expectation of the mean of error squares would be 
equal to error variance oğ, as given in equation 1.18. In statistical words, the mean of 
squares of error effects is an unbiased estimate of the error variance. 


SS 
MS, = ———\, F(MS,) = o? 1.18 
ara EMS (1.18) 


By a similar principle, the following relationship can be acquired between the 
expectation of SSç and the two unknown variances o% and oğ. 


- 7.5... kə) — (9. — Be) + (m; — BYP 
— ə... li) -.. BY ar G; 


go) — o, x... (g — 1), HE (g — 1)ro% 


where coefficient (g — 1) is called the degree of freedom of genotypes or genotypic 
effects. Let the mean of squares of genotypic effects, i.e., MSG, be equal to the sum of 
their squares divided by their degree of freedom. The expectation of the mean square 
of genotypic effects has a linear relationship with error variance o? and genetic 


variances Oo as given in equation 1.19. 


MSg — 58 E(MSq) = 02+ ro, (1.19) 
J= 


From equations 1.18 and 1.19, unbiased estimates can be acquired for previously 
defined variances o? and o?,, which are given by equations 1.20 and 1.21, respectively. 


ô? = MS,, (1.20) 


1 
6% = - (MSg — MS,) (1.21) 
Yi 
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Table 1.7 summarizes the previous analysis of observations from single envi- 
ronmental trials. Variation in observations comes from two sources. One is the 
genotypic difference among individuals in the genetic population, and the other one 
is arandom error associated with each replicated observation. The degree of freedom 
for genotypic variation is the number of genotypes minus one, £.e., (g — 1), the 
degree of freedom for random errors is g(r — 1); and the total degree of freedom is the 
number of observations minus one. It can be easily seen that the total degree of 
freedom is also equal to the sum of the two degrees of genotype and random error. In 
statistics, it can be proved that when ora = 0, the ratio of the two means of squares 
MS¢ and MS, follows the F distribution with two degrees of freedom, i.e., (g — 1) 
and g(r — 1), as given in equation 1.22. 


MS¢ 
MS, 


F= ~ F|g—1, g(r— 1)] (1.22) 


Therefore, the significance of genotypic variation in the studied population can 
be tested by an F statistic. If the observed F statistic does not exceed a threshold 
value at a given significance level, e.g., 0.05 or 0.01, genetic variation in the popu- 
lation is declared to be non-significant, and the population may be not suitable to 
conduct the genetic study on the investigated trait. 


TAB. 1.7 — ANOVA on single environmental phenotyping trials. 


Source of Degree of freedom Sum of squares Mean of squares Expectation 
variation (DF) (SS) (MS) of MS 
Genotype g-1 Se MSG o+ ro% 
Random error g(r — 1) SS, MS, o? 

Total gr-1 Söz 


In complete block experimental design, the difference between blocks could be 
significant as well. When this design is used, the block should also be considered as a 
source of variation in ANOVA, and its degree of freedom is equal to the number of 
blocks minus one. Consider blocks in ANOVA will not affect the degree of freedom 
and sum of squares from genotypes, but will affect the degree of freedom and sum of 
squares from errors. When the block is included in the linear model of ANOVA as a 
source of variation, the degree of freedom of errors will be reduced by the degree of 
freedom of blocks, and the sum of squares of error effects will be reduced by the sum 
of squares of block effects. Block variance can also be defined and estimated, but 
may not be of concern in most genetic studies. Once the error degree of freedom and 
its sum of squares have been adjusted by blocks, error variance and genetic variance 
can be calculated in the same way as shown by equations 1.20 and 1.21. 
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1.4.38 Single Environmental ANOVA on Rice Grain 
Length 


Table 1.8 gives the grain length of 22 genotypes, i.e., two rice cultivars (7.e., Asominori 
and IR24) and 20 recombination inbred lines (7.e., RILI-RIL20). These genotypes 
were grown in four environments represented by four locations in China, and each 
environment has two replications (Wan et al., 2005; Wan et al., 2004). Parent IR24 
consistently has a longer grain length than parent Asominori across the four 
environments. The grain length of most RILs is between the two parents, but some 
lines, e.g., RIL11 and RIL16, have shorter grain lengths than the shorter parent IR24. 


TAB. 1.8 — Two replicated (i.e., R1 and R2) grain lengths (mm) of two rice cultivars (i.e., 
Asominori and IR24) and their 20 RILs (i.e., RIL1-RIL20) in four locations (/.e., E1-E4). 


Genotype E1 (Nanjing, E2 (Jinhu, E3 (Donghai, E4 (Hainan) 
Jiangsu) Jiangsu) Jiangsu) 
RI R2 RI R2 RI R2 RI R2 

Asominori 5.26 5.17 5.17 5.21 5.16 5.22 5.33 5.21 
IR24 5.96 6.18 6.17 6.08 6.20 6.07 6.09 6.12 
RILI 5.36 5.49 5.54 5.40 5.44 5.45 5.62 5.39 
RIL2 6.30 6.37 6.24 6.15 6.19 6.32 6.49 6.33 
RIL3 5.29 5.36 5.36 5.27 5.27 5.17 5.33 5.39 
RIL4 5.44 5.42 5.37 5.45 5.47 5.41 5.35 5.34 
RIL5 5.34 5.41 5.40 5.32 5.41 5.32 5.33 5.28 
RIL6 5.38 5.46 5.53 5.34 5.40 5.41 5.40 5.33 
RIL7 5.45 5.45 5.51 5.37 5.30 5.53 5.46 5.42 
RIL8 5.65 5.65 5.65 5.65 5.65 5.65 5.76 5.63 
RIL9 5.11 5.21 5.19 5.17 5.24 5.15 5.11 5.11 
RIL10 5.13 5.26 5.21 5.22 5.30 5.22 5.44 5.34 
RILI1 4.98 4.80 4.72 4.88 4.78 4.85 5.48 5.29 
RIL12 5.68 5.69 5.75 5.66 5.63 5.72 5.71 5.67 
RIL13 5.00 4.91 4.88 5.00 4.94 5.13 4.88 4.89 
RIL14 5.83 5.88 5.87 5.86 5.89 5.85 6.00 5.94 
RIL15 5.41 5.56 5.53 5.50 5.59 5.47 5.50 5.60 
RIL16 4.90 4.75 4.86 4.77 4.64 4.88 4.92 4.89 
RIL17 5.62 5.62 5.69 5.58 5.54 5.66 5.53 5.42 
RIL18 5.95 6.05 6.09 5.93 6.01 5.97 5.88 5.84 
RIL19 4.99 4.94 4.95 5.06 5.13 4.87 5.09 5.02 
RIL20 6.10 5.91 6.12 5.90 6.00 6.11 6.05 6.07 


The purpose of conducting ANOVA is to estimate the genetic variance in the 
RIL population. Parents are mainly used as checks in genetic studies, which are not 
included in the analysis. For each environment, the number of genotypes g = 20; the 
number of replicates r = 2. Genotypic and error effects are included in the linear 
model of ANOVA; replication or block effect is not considered. The degree of free- 
dom is 19 for genotypes, and 20 for errors, with a total degree of freedom of 39, one 
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TAB. 1.9 — Single environmental analysis of variance on grain length of 20 RILs grown in four 
environments. 


Environment Source DF SS MS F-value Significance Variance 

E1 (Nanjing, Genotype 19 6.0795 0.3200 63.3617 <0.001 0.1575 

Jiangsu Random error 20 0.1010 0.0050 0.0050 
Tota. 39 6.1805 

E2 (Jinhu, Genotype 19 5.8770 0.3093 47.5505 <0.001 0.1514 

Jiangsu Random error 20 0.1301 0.0065 0.0065 
Tota. 39 6.0071 

E3 (Donghai, Genotype 19 6.0405 0.3179 39.1767 <0.001 0.1549 

Jiangsu Random error 20 0.1623 0.0081 0.0081 
Tota 39 6.2028 

E4 (Hainan) Genotype 19 5.5555 0.2924 61.7522 <0.001 0.1439 
Random error 20 0.0947 0.0047 0.0047 
Tota. 39 5.6502 


smaller than the total number of observations (table 1.9). Results indicate the sig- 
nificant probability is lower than 0.001, and therefore the genetic difference in grain 
length in the RIL population is extremely significant in each environment. Genetic 
variance is estimated to be much higher than random error variance (table 1.9). The 
significant results also indicate that random errors were well controlled in field 
experiments; phenotyping accuracy was satisfied; the population can be further used 
for genetic analysis and gene mapping. 


1.5 ANOVA on Multi-Environment Trials 


1.5.1 Linear Decomposition on Phenotypic Observation 


Assume a number of g genotypes are phenotyped in a number of e environments, 
each environment has a number of r replications. Let u;; be mean performance of the 
ith genotype in the jth environment, which is to be estimated. Under the assumption 
that the error effect follows the normal distribution with a mean of 0 and variance of 
o? (to be estimated as well), the kth observation of the ith genotype in the jth 
environment, i.e., yəy, follows the normal distribution with a mean of vu, and vari- 
ance of o? (equation 1.23). 


yük ~ N (Hij oz) (1.23) 


Let T.. be overall mean across the g genotypes and e environments, Zi,. be mean of 
the ith genotype across the e environments (or genotypic mean), and 7i.; be mean of 
the jth environment across the g genotypes (or environmental mean), that is, 


1 1 1 
H. = D Hip, Hi = ne Hip, Hj = get 
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Deviation of genotypic mean T7i,. to overall mean g.. is called genotypic effect, 
denoted by G;; Deviation of environmental mean H.j to overall mean H.. is called 
environmental effect, denoted by Ej, that is, 


G2(y. — B..), E(f — F.) 


The interaction effect between the ith genotype and jth environment, i.e., GEjj, 
can be defined as follows. 


GEyrmliq — Hi — Hej + p.. 
It can be easily seen that, 


Therefore, the observation can be rewritten as a linear model given in 
equation 1.24. 


Vi. = Hij Sü = L.. t G; t E; ł GE; ł Eijk (1.24) 


where eig ~ N(0, 67) (i= 1, 2,..., g, j= 1, 2,..., e k= 1, 2,..., r) iid. 

Similar to single environmental trials, error effects defined by equation 1.24 are 
required to follow the independent, and identical distribution, which is fundamental 
in conducting ANOVA. Theoretical genetic variance (07), environmental variance 
(oł), and genotype by environment (GE) interaction variance (o?,p) can be defined 
by equations 1.16, 1.17, and 1.18, respectively. 


1 
TD G? (1.25) 
ga S El (1.26) 
E e-1— 1 ` 
2 
1 
2 “2 F? 1.2 
TGE (g—1)(e-1) T G tj ( 7) 


Estimation of variance components 0%, 0%, TG, and o? from observation y; will 
be introduced in the following section, based on which the broad-sense heritability 
can be calculated. 


1.5.2 Decomposition of Sum of Squares of Phenotypic 
Deviations 


The overall sample mean, i.e., y..., is the simplified average across all observations, 


d i.e 
ger? 3 


which follows a normal distribution with a mean of H.. and variance of 
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Sample mean of the ith genotype in the jth environment is the simplified average 
across the r replications, which follows a normal distribution with mean of u, and 


2 


m oł $ 
variance of —, i.e., 
2 
P R E 
Yi = Yijks Mü: Hij» 
m” T 


Sample mean of the ith genotype is the simplified average across the e environ- 
ments and r replications, which follows a normal distribution with mean of Hi. = 


T.. + G; and variance of zə 1.€., 
a 1 2 on a 
Yi = er 3 Vük: Yi ™ uu T Gi; z) 


Sample mean of the jth environment is the simplified average across the 
g genotypes and r replications, which follows a normal distribution with mean of 


2 


T; = I.. + Ej and variance of a i.e., 


2 


jl o 
my ie məb, E 
UR gə, Yük UR (z +E; zi 


Based on overall sample mean, genotypic and environmental sample means, 
deviation of each observation to overall sample mean obtained by equation 1.24 can 
be equivalently re-written in equation 1.28. 


Yik — Y- = He — yyə) 
_ 1.28 
+ (Gigs — Yi- — Vg +Y) + (Yar — Yaj) .. 


On the right side of equation 1.28, the first term is the deviation of the sample 
mean of each genotype to the overall sample mean, which can be used to estimate 
the environmental effect G; and environmental variance m the second term is the 
deviation of the sample mean of each environment to overall sample mean, which 
can be used to estimate the environmental effect Ej and environmental variance o3; 
the third term is the deviation of the sample mean of each genotype and each 
environment to overall sample mean by taking away the genotypic and environ- 
mental effects, which can be used to estimate interaction effect GE; and interaction 
variance ae, gə the fourth term is deviation of each observation to sample mean of the 
genotype and environment, which can be used to estimate error effect sı, and error 
variance o?. Let SSz be the total sum of squares of all observations to the overall 
sample mean, which can be decomposed as follows. 
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SSr = 5 (Yijk = J) 


ijk 
2 
— 2.10) — By) + Dae — Tie- — 7: +B) + Gi — T.) + (Fy. — T) 
ijk 
2 
=X (vin — Ty) B 
i,j,k 


= m... 
+ eri (Ti. — J.) + .. (J-j. — V.-) 
i Jj 
Let SS, be sum of squares of error effects, i.e., SS, = Dijk (Yijk — Jaa which 
contains the information on error variance ə Let SSeg be sum of squares of GE 
interaction effects, i.e., SSce = r) i; (Uy — Yi — Yj- + J...) , which contains the 


information on GE interaction variance gs Let SS be sum of squares of genotypic 
effects, i.e., SSG = er 27) (gi —”- which contains information on genotypic 
variance. Let SSg be sum of squares of environmental effects, 1.c., 
SS = gr əə (7.7. — 7...)”, which contains information on environmental variance 


7. Thus, the total sum of squares can be decomposed as the summation of four 
components SSG, SSz, SSgg and SS,, i.e., 


SSr = SSg + SSg +SSgg +SS; 


The sum of squares from each source of variation can be immediately acquired 
once the multi-environment replicated phenotypic data is collected. A relationship 
between the sum of squares and unknown variances is needed so as to estimate each 
variance component. From the normal distributions followed by observations, the 
sample mean of each genotype, and each environment, the following relationship can 
be acquired between the expectation of SS, and the unknown error variance o?. 


SS.) = Ey (gün — By)” = EX [Yi — Hy) — Uy — uy)l” 


i,j,k i,j,k 
= ”— E(Yijk = — ə, jig) 
i,j,k 


gə 
= gero? — ger — = ge(r — 1), 
T 


where coefficient ge(r — 1) ahead of o? is called the degree of freedom of errors. Let 
the mean of squares of error effects, i.e., MS,, be equal to the sum of squares divided 
by degree of freedom. The expectation of mean square of error effects would be equal 
to error variance o?, as given in equation 1.29. Therefore, the mean square of error 
effects MS, is an web estimate of error variance o. 


SS, 


MS: = r-T 


E(MS,) = o) (1.29) 


The following relationship can be acquired between the expectation of SSeg and 
two unknown variances oğyp and oğ. 
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P(SSeE) = rE (Tj Ui. = Yj. + 7...” 


= Up = Yee Yee = GEy) + GE,” 


l 
Me 
si 


aj 
+r 5 GE; (multiplicative term = 0) 
ij 


where, 


second term = (g — 1) (e — 1)ro%p 


Therefore, 


E(SScr) = (g — 1)(e — 1a; + (g — 1)(e — 1)rogg 


where coefficient (g — 1)(e — 1) is called the degree of freedom of GE interactions. 
Let the mean of squares of GE interactions, i.e., MSgg, be equal to the sum of 
squares divided by the degree of freedom, i.e., equation 1.30. 


MS ge = E(MScr) = oz Erozp (1.30) 


(g—1)(e-1)’ 


The following relationship is held between the expectation of SS¢ and two 
unknown variances oz, and oğ. 


E(SSq) = EJ, (Ti — 7...)” 
= erE (Gi. — Fi) — (9... — TL.) + (F — BP 
= er EU s. — lü.) —- —. 


x x ız x oe +( 1)ero? 
=er — | —er — 1)er 
r” ger I 7 
2 
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where coefficient (g — 1) is called the degree of freedom of genotypic effects. Let the 
mean of squares of genotypic effects, i.e., MSg, be equal to the sum of squares 
divided by the degree of freedom, i.e., equation 1.31. 


SS 
MS¢= - 
g-1 


, E(MSg) = o) + ero% (1.31) 


The following relationship is held between the expectation of SSe and two 
unknown variances oy, and oğ. 


o; o; 2 
= gr x ar “gexlex on + g(e — 1)rok 


= (€ — 1)oz + gle — 1)roz 


where coefficient (e — 1) is called the degree of freedom of environmental effects. Let 
the mean of squares of environmental effects, i.e., MSz, be equal to the sum of 
squares divided by the degree of freedom, i.e., equation 1.32. 


SSe 
(e= 1) 


From equations 1.29-1.32, the unbiased estimates of variance components can be 


acquired, as given in equations 1.33-1.36, respectively, for error variance o?, GE 


MS = 


, E(MSe) = o? + gro’; (1.32) 


interaction variance o7,,, genotypic variance oz, and environmental variance o%. 


ô? = MS; (1.33) 
1 
1 

ôg == (MSg — MS,) (1.35) 
1 
gr 


Table 1.10 is a summary of the previous analysis. Four sources of variation are 
included in observations from multi-environmental trials, i.e., genotypic effect, 
environmental effect, GE interaction effect, and error effect. The degree of freedom is 
equal to (g — 1) for genotypes, (€ — 1) for environments, (g — 1)(e — 1) for GE 
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TAB. 1.10 — ANOVA on multi-environmental phenotyping trials. 


Source of Degree of freedom Sum of Mean of Expectation 
variation (DF) squares (SS) squares (MS) of MS 
Genotype g-1 SSG MSg 2+ erol, 
Environment g=1 SSz MS 24 gro 
GE interaction (g-1)(r-1) SScr MScer 2 Eroh 
Random error gelr — 1) SS, MS, a 

Total ger— 1 SSr 


interactions, and ge(r — 1) for random errors. The total degree of freedom is equal to 
(ger — 1), being equal to the sum of freedom degrees of the four sources of variation. 

When the complete block experimental design is used, the difference between 
blocks could be significant as well in some or all environments, and therefore should 
be considered as a source of variation in ANOVA. Each environment has the same 
number of r blocks. The degree of freedom of blocks is equal to (r — 1) for each 
environment, resulting in a total degree of freedom equal to e(r — 1). Consider 
blocks in ANOVA will not affect the degree of freedom and sum of squares from 
genotypes, environments, and GE interactions, but will affect the degree of freedom 
and sum of squares from errors. When the block is included in the linear model of 
ANOVA as a source of variation, the degree of freedom of errors will be reduced by 
the degree of freedom of blocks, and the sum of squares of error effects will be 
reduced by the sum of squares of block effects. Once the error degree of freedom and 
its sum of squares have been adjusted by blocks, error variance o?, GE interaction 
variance o?,p, genotypic variance o?, and environmental variance of can be calcu- 
lated in the same way as shown from equations 1.33-1.36. 


1.5.38 Multi-Environmental ANOVA on Rice Grain Length 


Consider the same dataset as given in table 1.8 on grain length in rice, where g = 20, 
e = 4, and r = 2. Genotypes, environments, GE interactions, and random errors are 
included in the linear model of ANOVA; replication or block effect was not con- 
sidered. The degree of freedom is equal to 19 for genotypes, 3 for environments, 57 
for GE interactions, and 80 for errors, with a total degree of freedom of 159, which is 
one smaller than the total number of observations (table 1.11). Results indicate that 
the significance probability is lower than 0.001 for genotypes, and therefore the 
difference in grain length in the RIL population is extremely significant across 
the four environments. Environmental effects on grain length are not significant at 
the 0.05 probability level. However, GE interactions are highly significant at the 0.01 
probability level. Genetic variance is estimated to be much larger than the two 
estimated variances of random error and GE interaction (table 1.11). The significant 
result on genotypic effects indicates that the population can be used for further 
genetic analysis and gene mapping on grain length in rice. 
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Tas. 1.11 — Combined analysis of variance on grain length of 20 RILs grown in four 
environments. 


Source DF SS MS F-value Significance Variance 


Genotype 19 22.8285 1.2015 196.9264 <0.001 0.1486 
Environment 3 0.0437 0.0146 2.3875 0.0751 0.0004 
GE interaction 57 0.7241 0.0127 2.0821 0.0013 0.0033 
Random error 80 0.4881 0.0061 0.0061 
Total 159 24.0844 


1.6 Estimation of Genotypic Values and the Broad-Sense 
Heritability 


In genetic populations, genotypic values (sometimes called phenotypic means as 
well) of individuals (can also be families or lines of close relationship and similar 
genotypes) are usually unknown, e.g., 4; in equation 1.14, and Hj in equation 1.23, 
but can be estimated from phenotypic observations. For example, u, can be esti- 
mated from observations yi by equation 1.15; u; can be estimated from observa- 
tions yix by equation 1.24. Genotypic value is normally estimated as the mean of a 
number of replicated phenotypic observations on the genotype. This is the reason 
why genotypic value u; or m; is sometimes called the phenotypic mean of the 
genotype. One major purpose to conduct phenotypic evaluation through single or 
multi-environmental trials is to estimate genotypic values and variance components 
and then uncover genetic architecture on the evaluated phenotypic trait. In quan- 
titative genetics, total genetic variance can be further decomposed into two com- 
ponents, i.e., additive and non-additive. Therefore, two kinds of heritability are 
frequently used in quantitative genetics, 2.e., broad-sense heritability and 
narrow-sense heritability. Broad-sense heritability is the proportion of total genetic 
variance to phenotypic variance; narrow-sense heritability is the proportion of 
additive genetic variance to phenotypic variance. Narrow-sense heritability is 
important in studying the genetic correlations between relatives and estimating 
genetic gains in random mating populations (Wang, 2017; Hallauer et al., 2010; 
Falconer and Mackay, 1996). In this book, broad-sense heritability is mostly con- 
cerned, unless particularly declared. 


1.6.1 Genotypic Values and Broad-Sense Heritability 
from Single Environmental Trials 


For single environmental trials, it can be seen from equation 1.15 that observation 
gə, contains information on mean performance of the ith genotype. Under the 
assumption that error effects follow the independent and identical normal distri- 
bution, the sample mean across replicates for the ith genotype is the best linear 
unbiased estimate (BLUE) for mean performance u,, as given in equation 1.7. 
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Best in BLUE stands for the minimum variance among all linear unbiased functions 
of the observations. Variance of the estimate is equal to error variance divided by the 
number of replicates. BLUE of mean performance of the ith genotype and variance 
of BLUE are given in equation 1.37. 


£. d _ ` 1 
Hu = Dy Yik = Yi» Viti) = =o: (1.37) 
k 


Estimate ji; calculated by equation 1.37 can act as the genotypic value and be 
further used for genetic mapping and other studies. It can be seen from equa- 
tion 1.37 that the more replicates, the smaller it would be for the variance of the 
estimate. Unbiasedness and smaller variance indicate the estimate would be closer to 
its true value which is to be estimated, and more reliable and valuable results would 
be achieved from further studies. When error variance o? 
variance of the estimate ji; can still be calculated by replacing o? in equation 1.37 
with its estimate calculated from equation 1.8, i.e., 62. 

In single environment trials, phenotypic observation from one individual or 
family is the sum of the overall mean, genotypic effect, and error effect, i.e., 
equation 1.38. Under the assumption that the error effects follow independent and 
identical normal distributions, phenotypic variance is the sum of two components. 
One component is the variance of genotypic effects and the other one is the variance 
of error effects, i.e., equation 1.39. Broad-sense heritability, represented by H’, is 
defined by equation 1.40. 


is also unknown, the 


P=u+G+e (1.38) 
ob = 0+ (1.39) 
2 2 
P= e (1.40) 
op OzT0ç 


Therefore, heritability can be estimated by replacing the two variance compo- 
nents in equation 1.40 with their estimates calculated by equations 1.20 and 1.21. 
Heritability thus defined and calculated by equation 1.40 is based on single obser- 
vations. However, gene mapping is normally based on the mean value across repli- 
cations of each genotype or the replicated mean. As shown in equation 1.37, error 
variance is reduced by proportion 1 in replicated mean. VVhen replicated means of all 
genotypes are considered, genetic variance keeps the same as that in equation 1.39, 
but the error variance becomes m times of the error variance given in equation 1.39. 
Thus, when replicated mean is used as phenotype, phenotypic variance and heri- 
tability can be calculated by equations 1.41 and 1.42, respectively. Obviously, her- 
itability is increased due to the reduced error variance in replicated means. 


1 
oz =dgt+-o (1.41) 
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2 2 
206 oG 
H T > l,} (1.42) 


P o+- 
pve 


1.6.2 Genotypic Values and Broad-Sense Heritability 
from Multi-Environmental Trials 


For multi-environmental trials, error variance is assumed to be homogeneous across 
environments, represented by o?, which is sometimes called the combined error 
variance across environments. MS, shown in table 1.10 is an unbiased estimate of the 
combined error variance o?. From distribution equation 1.23, it can be seen that the 
observation yj, contains information on f4,;, mean performance of the ith genotype 
in the jth environment. Under the assumption that error effects follow independent 
and identical normal distributions, mean across replicates for the ith genotype in the 
jth environment is the BLUE for mean performance p;;. BLUE of u; and variance of 
the BLUE are given in equation 1.43. BLUE of 7. and variance of the BLUE are 
given in equation 1.44. 


1 
= w= Tyo Vy) — o) (1.43) 


r 


1 1 = l 
Hi = -X Wy T Yi VT) = o, (1.44) 


In multi-environment trials, phenotypic observation from one individual or 
family is the sum of the overall mean, genotypic effect, environmental effect, GE 
interaction, and error effect, i.e., equation 1.45. Under the assumption that error 
effects follow the independent and identical normal distributions, phenotypic vari- 
ance is the sum of four components. As given in equation 1.46, the four components 
are the variance of genotypic effects, the variance of environmental effects, the 
variance of GE interaction effects, and the variance of error effects. In equation 1.46, 
environmental variance oz is ascribed to the non-inherited factors, which should not 
be considered in the estimation of heritability. Broad-sense heritability H” is 
therefore defined by equation 1.47. 


P=pu+G+E+GE+e (1.45) 

Op =GytGntOuptor (1.46) 
2 

me-—. (1.47) 


OG b Oop boğ 
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Therefore, heritability can be estimated by replacing the variance components 
in equation 1.47 with their estimates calculated by equations 1.33 and 1.36. 
Heritability thus defined and calculated by equation 1.47 is based on single 
observations. However, gene mapping can be based on the mean value across 
replications and environments of each genotype, calculated by equation 1.44. As 
far as the mean values of genotypes are concerned, genetic variance keeps the same 
as that in equation 1.47, but GE interaction variance becomes 1 times of inter- 
action variance as given in equation 1.47, and error variance becomes m times of 
error variance as given in equation 1.47. Thus, when mean across environments 
and replications is used as phenotype, phenotypic variance and heritability can be 
calculated by equations 1.48 and 1.49, respectively. Obviously, heritability is 
increased due to the reduced GE interaction variance and error variance in 
replicated means. GE interactions increase phenotypic variance and therefore 
reduce heritability. 


1 1 
a — oz zoet T (1.48) 
2 2 
0 TG 
= ee o+ topt to ə... 
P G e” GE  re”£ 


1.6.3 Estimation of Genotypic Values Under 
Heterogeneous Error Variances 


In one specific environment, error effects can be assumed to have a mean of 0 and 
equal variance of normal distribution among genotypes. However, error variances 
may differ among environments due to the varied environmental conditions and 
planting systems. For example, under elite growing conditions, error effects may be 
smaller due to the uniformity in irrigation, soil fertility, pest control, and agronomic 
management in the field, and therefore the observation is approximated to true 
genotypic values. On the contrary, under drought and rainfed conditions, error effects 
may be much larger due to the less uniformity in growth factors, and therefore the 
observation is more deviated from true genotypic values. Intuitively, a well-controlled 
environment always has a much smaller error variance than a less-controlled envi- 
ronment, and the error effects in the tested environments are called heterogeneous. 

Assume one genotype has mean performance u, which is evaluated in a number 
of e heterogeneous environments. In the jth environment, error variance is repre- 
sented by o, and one single observation is represented by y; For replicated obser- 
vations, replicated mean is used as the phenotypic observation y;. The linear model 
of the observation is given below. Error effects are still independent, have a mean of 
0, but they do not follow an identical normal distribution. 


yi = u+ £j, where e; ~ N(0, oz) (j = 1,2,..., e) are independent 
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Under the condition of heterogeneous error variances, the sample mean y = 19. im 
is still an unbiased estimate of the genotypic mean vu, but is no longer the best with the 
minimum variance. In other words, there exists one other unbiased linear estimate 
which has a smaller variance than sample mean y. Before moving further on this topic, 
Bartlett’s test on the homogeneity of error variances is introduced. Let öz and df, be 
the estimated error variances and its degree of freedom, respectively, in the jth envi- 
ronment. Null (Hp) and alternative (HA) hypotheses in the test are listed as follows. 


e nə — 2 
Ho : 6;, Oş on 


: 2 2 
HA : at least two among oz, o, 


e.ə Gare not equal 


Under null hypothesis Ho, the combined error variance o? can be calculated from 
estimates in the e environments, that is, 


1 
2 2 
===), df x oÈ 
i 2 dfe; j á . 


Bartlett”s test, given below, asymptotically approaches a chi-square distribution 
with a degree of freedom e — 1 and can be used to test the significance of the null 
hypothesis. 


p= (>: €) In(o2) — Yo df, x a(o?) ~ èle- 1) 
j j 


For the estimated error variances in four environments, as shown in table 1.9, it 
can be found 7? = 1.188 (df = 3) and P = 0.756. Therefore, error variances can be 
assumed to be homogeneous in the four tested environments. 

By constructing the linear unbiased function of y; (j= 1, 2, ..., e) and then 
calculating its variance, one linear function with minimum variance can be identified 
and is given in equation 1.50 (see also exercise 1.11 for e = 2). The variance of the 
BLUE is given in equation 1.51. 


1 
” 
ù= — wjyj, Where wj = i il : i (1.50) 
3 Fə) + zə a. gə 
êl E2 Ee 
7 1 
V(t) =- 1 (1.51) 
eta t te 
El E2 Ee 


The variance of the sample mean, i.e., V(Ņ), is given in equation 1.52. It can be 
shown that V(À) defined in equation 1.51 is equal to V (y) defined in equation 1.52 
only under the condition of homogeneous error variances, i.e., when the null 
hypothesis is true in Barlett’s test. 
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1 
VQ) = 4G, +0, + +) (1.52) 
e 
From equation 1.50, it can be seen that the environment with a smaller error 
variance would have greater weight in the estimate. To further illustrate the 
advantages of the weighted mean, e = 2 is considered. Assume o, 
the ratio of the two error variances. Variances of unweighted and weighted means 
together with the ratio of the two variances are given below. 
s 1 Vtü) 4s 
V(ü) = ——o?, V(y) — “(14 5)o7, z 
0) - n e = n a T 


= so’, where s is 


Figure 1.12 shows the ratio of variances of the two estimates by the ratio of the 
two error variances. Variances of the two estimates are equal only when the two 
error variances are equal. For any unequal error variances, the variance of BLUE or 
the weighted mean is always lower than the variance of the simple sample mean. 
o? #0 and oz, =0 (i.e., s = 0) represent an extreme situation where no error is 
included in observation in environment 2. In other words, observation in environ- 
ment 2 is equal to the true value of the genotype, i.e., gə = u. However, either 
positive or negative error effect is included in observation in environment 1, i.e., 
yı # u. Therefore, the sample mean $(y+%) # u. In this extreme situation, 
observation in environment 2 is obviously the best estimate, i.e., yı has a weight of 0, 
and gə has a weight of 1. Any consideration of y; in the estimate will cause the bias to 
the true value of mean performance u. oz, #0 and oz, = oo represent one other 
extreme situation where observation in environment 2 does not contain any infor- 
mation about mean performance u. In this situation, observation in environment 1 is 
obviously the best estimate, i.e., yı has a weight of 1, and gə has a weight of 0. Any 
consideration of y in the estimate will increase the bias to the true value of mean 
performance Hu. 


Ratio of variances from two 
estimates 


Ratio of error variances in two environments 


Fic. 1.12 — Ratio of two variances from the best linear unbiased estimate (BLUE) and the 
simple mean. 
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Using the grain length dataset in table 1.8, replicated means of two parents and 
their 20 RILs are given in the first part of table 1.12. Given in the last two columns in 
table 1.12 are simplified sample mean and weighted mean across four environments. In 
this multi-environmental trial, error variances were previously tested to be homoge- 
neous across environments. Therefore, the difference between the last two columns is 
minor. In the following genetic mapping studies, replicated means in each environ- 
ment, i.e., columns 2-5, should be used if the major objective is to understand the 
effect of grain length genes in different environments. Estimates across environments, 
i.e., columns 6 or 7, should be used if the major objective is to identify grain length 
genes with stable effects across environments, and GE interaction is not the issue. 


TAB. 1.12 — Estimation of genotypic values on grain length (mm) of two rice cultivars (i.e., 
Asominori and IR24) and their 20 RILs (i.e., RIL1-RIL20) in four locations (i.e., E1-E4). 


Genotype Replicated mean Simple mean Weighted mean 
El El E3 E4 

Asominori 5.215 5.190 5.190 5.270 5.216 5.218 
IR24 6.070 6.125 6.135 6.105 6.109 6.105 
RIL1 5.425 5.470 5.445 5.505 5.461 5.460 
RIL2 6.335 6.195 6.255 6.410 6.299 6.305 
RIL3 5.325 5.315 5.220 5.360 5.305 5.311 
RIL4 5.430 5.410 5.440 5.345 5.406 5.405 
RIL5 5.375 5.360 5.365 5.305 5.351 5.352 
RIL6 5.420 5.435 5.405 5.365 5.406 5.407 
RIL7 5.450 5.440 5.415 5.440 5.436 5.438 
RIL8 5.650 5.650 5.650 5.695 5.661 5.662 
RIL9 5.160 5.180 5.195 5.110 5.161 5.159 
RIL10 5.195 5.215 5.260 5.390 5.265 5.262 
RILI1 4.890 4.800 4.815 5.385 4.973 4.981 
RIL12 5.685 5.705 5.675 5.690 5.689 5.689 
RIL13 4.955 4.940 5.035 4.885 4.954 4.949 
RIL14 5.855 5.865 5.870 5.970 5.890 5.890 
RIL15 5.485 5.515 5.530 5.550 5.520 5.517 
RIL16 4.825 4.815 4.760 4.905 4.826 4.831 
RIL17 5.620 5.635 5.600 5.475 5.583 5.583 
RIL18 6.000 6.010 5.990 5.860 5.965 5.965 
RIL19 4.965 5.005 5.000 5.055 5.006 5.004 
RIL20 6.005 6.010 6.055 6.060 6.033 6.030 
Exercises 


1.1 Assume 9 homozygous parents are used to conduct the genetic mating design. 
How many single crosses are there for each of the four designs, i.e., complete diallel, 
partial diallel (4 used as the female parents, and 5 used as male parents), single 
chain, and double chain. 
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1.2 In figure 1.7, marker type code 2 represents genotype AA of parent Harrington, 
code 0 for genotype aa of parent TR306, and code —1 for missing. Using the data 
given at locus Aga7, calculate sample sizes of different marker types, calculate gene 
and genotypic frequencies, and conduct the 1:1 segregation ratio test on two 
homozygous genotypes in the population. 


1.3 In one rice Fy population with a size of 180 individuals, the observed numbers of 
three marker types (coded as 2, 1 and 0) and missing value (coded as —1) are given 
below for 9 molecular markers. 


Marker locus Three marker types and missing value 
2 1 0 =] 

RM6 2 29 84 55 12 
RM6_7 33 85 54 8 
RM6_ 13 21 84 68 

RM6 17 20 85 65 10 
RM6 19 20 83 66 11 
RM6 350 31 95 52 2 
RM6 33 34 91 53 2 
RM6_ 34 34 90 48 8 
RM6_ 42 39 100 41 0 


(1) Calculate gene and genotypic frequencies at each marker locus. 
(2) Conduct the 1:2:1 segregation ratio test on three marker types at each locus. 


1.4 In figure 1.8, assume the genotypes of two parents are AA and aa, and their Fı 
genotype is Aa. In their Fə population, A stands for genotype AA, B for genotype aa, 
H for genotype Aa, D for mixture genotype Aa+AA, R for mixture genotype Aa+aa, 
and X for missing. 


1) For two markers M1-1 and M1-2, count the sample sizes of different marker 
types, calculate gene and genotypic frequencies, and conduct the 1:2:1 segre- 
gation ratio test for each locus. 

2) For two markers M1-3 and M1-6, count the sample sizes of different marker 
types and conduct the 3:1 segregation ratio test for each locus. 

3) Use acontingency table to test the independence between M1-1 and M1-2 (here, 
independence means the unlinked genetic relationship). 

4) Use a contingency table to test the independence between M1-1 and M1-3. 

5) Use a contingency table to test the independence between M1-1 and M1-6. 

6) Use a contingency table to test the independence between M1-3 and M1-6. 


1.5 Family structure is not considered. For populations given in figure 1.1, show 
that at each co-dominant locus the F3 population follows the 3:2:3 theoretical seg- 
regation ratio, and the P;BCsF, population follows the 13:2:1 theoretical segrega- 
tion ratio. 
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1.6 Given below are sample means and variances on one trait in two parents P; and 
P> together with their F; and F, populations. Each population has 100 observations. 


Population Sample size Sample mean Sample variance 
P: 100 69.44 59.73 

Po 100 59.04 65.71 

F, 100 83.44 51.81 

Fy 100 74.36 100.75 


1) Test the homogeneity of three error variances estimated in two parental and F; 
populations, and calculate the combined error variance. 

2) Based on the combined error variance, calculate genetic variance and 
broad-sense heritability in the Fə population. 

3) Under the single-locus model, calculate the additive effect, dominant effect, and 
degree of dominance using the estimated means of two parental and Fy 
populations. 

4) Under the single-locus model, calculate the mid-parental value, additive effect, 
and dominant effect from the estimated means of two parental, Fi, and Fə 
populations by the least square estimation method. What is the estimated 
degree of dominance at the locus? 


1.7 The following table gives the frequency distribution of maize ear length in two 
parents, and their F; and F, populations, which is adapted from the hybridization 
experiment in East (1911). Assume parent P contains all alleles that reduce ear 
length, parent Pə contains all alleles that increase ear length, and inheritance of ear 
length can be fitted by the multi-factorial hypothesis, i.e., ear length is controlled by 
a number of independently inherited genes with equal additive effects. Try to esti- 
mate the number of pairs of genes on ear length and the average additive effect at 
each locus. 


Mid-group value of ear length (cm) in maize 


Population 76 7 89 10 11 12 13 14 15 16 17 18 19 20 21 
P, 4 21 24 8 

F; 1 12 12 14 17 9 4 

P, 3 11 12 15 26 15 10 7 2 
Fə 4 5 22 56 80 145 129 91 63 27 17 6 1 


1.8 The following table gives the frequency distribution of corolla length in tobacco 
in two parents, and their F,; and F> populations, which is adapted from the 
hybridization experiment in East (1916). Assume parent P, contains all alleles that 
reduce corolla length, parent Pə contains all alleles that increase corolla length, and 
inheritance of corolla length can be fitted by the multi-factorial hypothesis, t.e., 
corolla length is controlled by a number of independently inherited genes with equal 
additive effects. Try to estimate the number of pairs of genes on corolla length and 
the average additive effect at each locus. 
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Mid-group value of corolla length (mm) in tobacco 


56 58 61 64 67 70 73 76 79 


82 


100 


ST 


Surddeyy əuər) pue sısAyeuy əseyur? 
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1.9 Assume allele A is dominant to allele a. In one Fə population derived by two 
parents having genotypes AA and aa, one individual showing the dominant phe- 
notype could have genotype AA or genotype Aa. In genetics, the genotype of one 
dominant individual can be further determined by the selfed F3 family. The domi- 
nant individual is assumed to have genotype AA when no phenotypic segregation is 
observed in its selfed F3 family; otherwise, the dominant individual is assumed to 
have genotype Aa. Assuming the selfed F; family only has 5 individuals, how large is 
the probability to misclassify the true genotype Aa as AA? To assume the mis- 
classifying probability below 0.05, how many Fs individuals are required? To assume 
the misclassifying probability below 0.01, how many F} individuals are required? 


1.10 Let faa, faa: and faa be frequencies of three genotypes AA, Aa, and aa in a 
population. Under the single-locus model, show the following equation on genetic 


variance 0%. 


o% = [faa + foa — (faa = faa) la = 2faa(faa = Taa)ad-- (faa = VALI 


1.11 Phenotypic observations of one genotype in two environments are represented 
by random variables yı and y2, which independently follow two normal distributions 
having the same mean p, and variances o? and oş, respectively. The unbiased linear 
function of the two random variables is given by z = bişi + bey, where bi and bg are 
constant numbers with the constriction bi + b2 = . Show that the unbiased linear 


combination given by equation z= >> 7a yı + PETAL has the minimum variance 


among all unbiased linear conbinations: 


1.12 Given below are maize silking days from emergence with three replications for 
20 inbred lines grown in drought and well-water conditions. 


Maize inbred line Drought environment Well-water environment 
Rep. 1 Rep. 2 Rep. 3 Rep. 1 Rep. 2 Rep. 3 

1 101 90 91 89 89 93 
2 82 85 87 84 84 83 
3 86 85 83 80 84 88 
4 85 85 87 83 84 83 
5 80 82 81 81 82 83 
6 95 98 95 89 94 95 
T 84 85 85 81 84 85 
8 86 85 87 84 85 83 
9 87 89 91 85 88 87 
10 84 85 89 82 85 85 
11 82 85 83 81 85 82 
12 83 85 87 83 87 83 
13 89 87 94 89 92 88 
14 90 92 93 89 91 90 


15 95 89 95 89 90 91 
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(continued). 
Maize inbred line Drought environment Well-water environment 
Rep. 1 Rep. 2 Rep. 3 Rep. 1 Rep. 2 Rep. 3 

16 82 85 87 84 85 87 

17 91 95 92 89 94 91 

18 88 90 89 89 91 88 

19 84 85 87 82 85 85 

20 88 92 101 89 90 93 

(1) Conduct ANOVA, and estimate broad-sense heritability on maize silking day in 


each of the two growing conditions. Replication is not considered in the linear 
model of ANOVA. 

Test the homogeneity of error variances estimated in two growing conditions. 
Calculate weighted mean or BLUE on the silking day for each inbred line. 


Chapter 2 


Estimation of the Two-Point 
Recombination Frequencies 


Recombination frequency is defined as the probability for crossing-over to happen 
between two homologous chromosomes during one generation of meiosis. One 
crossing-over represents one exchange event between two homologous chromosomes 
when they are paired. As far as one complete chromosome is concerned, 
crossing-over may happen at several locations simultaneously. Strictly speaking, 
recombination frequency between two linked loci on the chromosome is the proba- 
bility for an odd number of crossing-overs to happen. For two closely linked loci, the 
chance of twice and more than two times crossing-over is extremely low due to the 
short chromosomal interval. Therefore, in two-locus (or more often called two-point) 
genetic and linkage analysis, only no crossing-over and one time of crossing-over are 
considered for a specific chromosomal region. Under normal conditions, the prob- 
ability of recombination is higher in the longer chromosomal region, and lower in the 
shorter chromosomal region, and therefore the recombination frequency can be used 
to measure the length of a chromosomal region, or equally, the genetic distance 
between two loci at either end of the chromosomal region. Testing the linkage 
relationship between two loci and estimating their recombination frequency 
are classical tasks in genetic studies (Hartl and Jones, 2005; Bailey, 1961; 
Kempthrone, 1957), which are fundamental to linkage map construction and gene 
mapping. In this chapter, the estimation of two-point recombination frequency will 
be introduced through the linkage analysis in various bi-parental populations. 


2.1 Generation Transition Matrix 


2.1.1 Usefulness of the Transition Matrix in Linkage 
Analysis 
Let A/aand B/b be the alleles at two loci that are linked in one chromosome; AA BB 


(sometimes denoted as AB/AB) and aabb (sometimes denoted as ab/ab) be geno- 
types of two homozygous parents, where the slash symbol separates two homologous 
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chromosomes. There are three possible genotypes at each locus and nine possible 
genotypes when two loci are considered together. Given the recombination 
frequency, the nine genotypes have particular frequencies in particular populations, 
which are called the expected or theoretical frequencies. These theoretical 
frequencies are particular functions depending on recombination frequency, 
providing the basic information to estimate it from the observed numbers of various 
genotypes in the population. As indicated in figure 1.1, some populations are 
developed from the F, hybrid after one generation of propagation, such as one 
generation of backcrossing, FıDH and Fə. There is only one genotype present in the 
F; population, i.e., AaBb (or AB/ab), and the genotypic frequencies in these pop- 
ulations can be easily derived from frequencies of gametes generated by AB/ab. 
Some populations are developed by several generations of backcrossing and selfing, 
such as BC, F,, BC2P3, F3, and RIL, etc. Theoretical genotypic frequencies cannot 
be easily derived, and the generation transition matrix has to be adopted. 

When heterozygotes are present in a bi-parental population, the double 
heterozygous genotype, i.e., AaBb (heterozygotes at both loci), comes from two 
different linkage phases represented by genotype AB/ab and genotype Ab/aB, where 
the slash symbol separates two homologous chromosomes. Two haploids in genotype 
AB/ab, i.e., AB and ab, belong to two parental types, and the allelic phase at the two 
loci is sometimes called the coupling linkage or linkage in coupling. In comparison, 
two haploids in genotype Ab/aB, i.e., Ab and aB, belong to two recombination types, 
and the allelic phase is called the repulsive linkage or linkage in repulsion. Coupling 
and repulsive linkage phases in double heterozygotes cannot be separated without 
additional information. The two phases have to be treated as one unified genotype, 
i.e., AaBb, when counting the numbers of different genotypes in a population in 
estimating the recombination frequency. However, gametes generated by the two 
linkage phases have different frequencies. To properly derive the genotypic frequen- 
cies after propagation, frequencies of the two phases have to be considered separately. 

To acquire genotypic frequencies in various bi-parental populations starting from 
the F; generation, 10 genotypes are temporarily considered at two loci, denoted by 
types 1-10. The order of these genotypes can be identified from the following vector 
of theoretical genotypic frequencies at generation t, i.e., po. 


t) — | ÇÜ (t) (t) (t) (t) (t) (t) (2 (t) (t) 
p” = Eve PAABo PAAbb PaanB PAB/ab PAbjaB PAabb PaaBB Paabo Paabb 


Individuals in a practical limited-size population can be treated as samples from 
an infinite population having the theoretical genotypic frequencies given above. 
Observed numbers of individuals having the 10 genotypes follow a multinomial dis- 
tribution with the 10 theoretical frequencies. All possible values are included in the 10 
types, and therefore the sum of the 10 frequencies is equal to 1. Vector po containing 
frequencies of the 10 types is called a probability vector. For convenience, selfing, 
repeated selfing, backcrossing, and doubled haploid are all called mating systems in 
the general sense. After mating, the population is advanced from generation £ to 
generation t+ 1. In the meantime, genotypic frequencies also change in the popu- 
lation of generation £ + 1, which is represented by probability vector pe, 1.€., 
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pit») m | (41) (641) (641) (641)  (t+1) (t+1) (+1)  (t+1) (141) (t4 A 


AABB PAABb PAA» PAaBB PaB/ad PAbjaB PAabb PaaBB PaaBs Paabo 


Genotypic frequencies in generation t + 1 depend only on frequencies in gener- 
ation £, regardless of frequencies in earlier generations. By taking the observed 
numbers on genotypes as random variables having a multinomial distribution, these 
variables in various generations then become a Markov chain. Let T be the transition 
matrix from generation £ to £ + 1, which depends obviously on the mating system in 
propagation. Each row in matrix T actually represents frequencies of the progeny 
genotypes generated by each genotype in generation £. The sum of the elements in 
each row is equal to 1, and therefore the generation transition matrix T is called the 
probability transition matrix in probability and statistics. After one generation of 
mating, genotypic frequency vector pot) can be expressed by multiplication between 
the probability vector before mating and the transition matrix, i.e., equation 2.1. 


pit) = por (2.1) 


Thus, if the generation transition matrix for each mating system can be identi- 
fied, genotypic frequencies in most bi-parental populations can be calculated. Given 
below are transition matrices for backcrossing with parent Pı, backcrossing with 
parent Ps, one generation of selfing, doubled haploid, and repeated selfing, denoted 
by Tpip, Tpop, Ts, Tp, and Tr, respectively. 


2.1.2 Transition Matrix of One Generation 
of Backcrossing 


Taking the backcrossing with parent Pı as an example to illustrate in detail the 
backcrossing transition matrix given in equation 2.2. Based on previous discussions, 
10 genotypes are considered separately. 


(1) Genotype 1 (AABB). When individuals with genotype AABB are backcrossed 
with parent P, with genotype AABB, all progenies have the same genotype, 
i.e., type 1 (AABB). Therefore, the first element in row 1 of transition matrix 
Tpyp is equal to 1, and the other elements in row 1 are equal to 0 (equation 2.2). 

(2) Genotype 2 (AABb). When type 2 individuals are backcrossed with parent P; 
(AABB), their progenies have two genotypes, i.e., type 1 (AABB) and type 2 
(AABb). The two types have an equal frequency in the progeny generation. 
Therefore, the first and second elements in row 2 of transition matrix Tpıg are 
both equal to 4, and the other elements in row 2 are equal to 0 (equation 2.2). 

(3) Genotype 3 (AAbb). When type 3 individuals are backcrossed with parent P; 
with genotype AABB, all progenies have the same genotype, i.e., type 2 
(AABb). Therefore, the second element in row 3 of transition matrix Tp 1p is 
equal to 1, and the other elements in row 3 are equal to 0 (equation 2.2). 

(4) Genotype 4 (AaBB). When type 4 individuals are backcrossed with parent P; 
(AABB), their progenies have two genotypes, i.e., type 1 (AABB) and type 4 
(AaBB). The two types have an equal frequency in the progeny generation. 
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Therefore, the first and fourth elements in row 4 of transition matrix Tp ;p are 
both equal to 5, and the other elements in row 4 are equal to 0 (equation 2.2). 

(5) Genotype 5 (AB/ab). When individuals with genotype AB/ab are backcrossed 
with parent Pı (AABB), their progenies have four genotypes, i.e., type 1 
(AABB), type 2 (AABb), type 4 (AaBB) and type 5 (AB/ab). Types 1 and 5 
come from the combination of non-recombination gametes generated by AB/ab 
and the only gamete AB generated by parent Pı. Two non-recombination 
gametes generated by AB/ab, i.e., AB and ab, have an equal frequency $ (1 — r). 
Types 2 and 4 come from the combination of recombination gametes generated 
by AB/ab and the only gamete AB generated by parent Pi. Two recombination 
gametes generated by AB/ab, i.e., Ab and aB, have an equal frequency ör. 
Therefore, the first, second, fourth, and fifth elements in row 5 of transition 
matrix Tp;p are equal to 2(1 — r), $r, ir, and 3(1 — r), respectively, and the 
other elements in row 5 are equal to 0 (equation 2.2). 

(6) Genotype 6 (Ab/aB). In comparison with genotype 5, gametes Ab and aB 
generated by Ab/aB become non-recombinant types, each with frequency 
$(1 — r): AB and ab are recombinant types each with frequency 5 r. Therefore, 
the first, second, fourth and fifth elements in row 6 of transition matrix Tpipn 
are equal to 4r, 2(1 — r), 2(1 — r) and 4r, respectively, and the other elements 
in row 6 are equal to 0 (equation 2.2). 

(7) Genotype 7 (Aabb). When type 7 individuals are backcrossed with parent Pı 
(AABB), their progenies have two genotypes, £.e., type 2 (AABb) and type 5 
(AB/ab). The two types have an equal frequency in the progeny generation. 
Therefore, the second and fifth elements in row 7 of transition matrix Tp;p are 
both equal to 5, and the other elements in row 7 are equal to 0 (equation 2.2). 

(8) Genotype 8 (aaBB). When type 8 individuals are backcrossed with parent P; 
with genotype AABB, all progenies have the same genotype, i.e., type 4 
(AaBB). Therefore, the fourth element in row 8 of transition matrix Tpıg is 
equal to 1, and the other elements in row 8 are equal to 0 (equation 2.2). 

(9) Genotype 9 (aaBb). When type 9 individuals are backcrossed with parent Pı 

(AABB), their progenies have two genotypes, i.e., type 4 (AaBB) and type 5 

(AB/ab). The two types have an equal frequency in the progeny generation. 

Therefore, the fourth and fifth elements in row 9 of transition matrix Tp;p are 

both equal to 4, and the other elements in row 9 are equal to 0 (equation 2.2). 

Genotype 10 (aabb). When type 10 individuals are backcrossed with parent P 

with genotype AABB, all progenies have the same genotype, i.e., type 5 

(AB/ab). Therefore, the fifth element in row 10 of transition matrix Tpıp is 

equal to 1, and the other elements in row 10 are equal to 0 (equation 2.2). 


(10 


— 


By investigating possible genotypes and their frequencies in backcrossing pro- 
geny for each of the 10 genotypes, the transition matrix of backcrossing can be 
determined. As given in equation 2.2, Tpih is a 10 X 10 square matrix. Each 
genotype takes one row to specify the genotypic frequencies in the backcrossing 
progeny. When backcrossed with parent P (AABB), type 3 (Aabb), type 6 (Ab/aB), 
type 7 (Aabb), type 8 (aaBB), type 9 (aaBb) and type 10 (aabb) are all absent. 
Elements in columns 3 and 7-10 are all equal to 0. 
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1 0 0 0 0 0 0 0 0 0 
1 1 
= = 0 0 0 0 0 0 0 0 
2 2 
0 1 0 0 0 0 0 0 0 0 
1 1 
2 0 0 3 0 0 00 0 0 
1 1 1 
-(1—r) -T 0 =r =(l1-r) 00 0 0 0 
2 2 2 
Trip = 1 1 1 1 (2.2) 
1 1 
0 = 0 0 - 0 0 0 0 0 
2 2 
0 0 0 1 0 0 0 0 0 0 
0 0 0 : l 00000 
2 2 
0 0 0 0 1 00000 


Similarly, the transition matrix of backcrossing to parent Po (aabb), i.e., Tpop, 
can be determined and given in equation 2.3. When backcrossed with parent P, 
(aabb), type 1 (AABB), type 2 (AABb), type 3 (AAbb), type 4 (AaBB), type 6 
(Ab/aB), and type 8 (aaBB) are all absent. Elements in columns 1-4, 6 and 8 are all 
equal to 0. In addition, non-zero columns 5, 7, 9, and 10 in Tpəp are actually the 
same as non-zero columns 1, 2, 3, and 4 in Tp p, respectively. 


0 0 0 0 1 0 0 0 0 0 
1 1 
0 0 0 0 3 0 3 0 0 0 
0 0 0 0 0 0 1 0 0 0 
1 1 
0 0 0 0 3 0 0 0 3 0 
1 1 1 1 
000 0 =-(1-r) 0 =r 0 = -(1-—r) 
2 2 2 2 
TpəB = 1 1 1 1 (2.3) 
1 1 
0 0 0 0 0 0 3 0 0 3 
0 0 0 0 0 0 0 0 1 0 
1 1 
0 0 0 0 0 0 0 0 z 3 
0 0 0 0 0 0 0 0 0 1 


2.1.3 Transition Matrix of One Generation of Selfing 


The transition matrix of one generation of selfing, i.e., Ts, is given in equation 2.4. 
Based on the number of heterozygous loci, three cases will be considered on the 10 
genotypes, i.e., no heterozygote, one-locus heterozygote, and two-locus heterozygote. 
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No heterozygote, or homozygous at both loci. Four of the 10 genotypes belong 
to this case, i.e., type 1 (AABB), type 3 (AAbb), type 8 (aaBB), and type 10 
(aabb). Homozygote does not change its genotype in the progeny population 
through selfing propagation. Therefore, in transition matrix Ts, the first ele- 
ment in row 1, the third element in row 3, the eighth element in row 8, tenth 
element in row 10 are equal to 1; the other elements in rows 1, 3, 8, and 10 are all 
equal to 0 (equation 2.4). 

One-locus heterozygote and one-locus homozygote. At the heterozygous locus, 
three genotypes follow the 1:2:1 segregation ratio after selfing or have frequencies 
1 4, and 1 in the selfed progeny. Take type 2 (AABb) as example. The selfed 
generation of AA Bb has type 1 (AABB), type 2 (AABb), and type 3 (AAbb), with 
frequencies 1, $, and 1, respectively, corresponding to the first three elements in row 
2 in transition matrix Tg (equation 2.4). Type 4 (AaBB), type 7 (AAbb) and type 
9 (aaBb) belong to this case as well. Genotypic segregation in the selfed progeny is 
similar to type 2. Non-zero elements in rows 4, 7, and 9 in transition matrix Ts can 
be determined according to the three segregating genotypes in the selfed progeny. 
Double heterozygote, i.e., type 5 (AB/ab) and type 6 (Ab/aB). The 10 genotypes 
are all present in their selfed progeny. Take type 5 (AB/ab) as an example to 
illustrate the genotypic frequencies in the selfed progeny of double heterozygotes. 
Individuals with genotype AB/ab generate four types of female and male game- 
tes, i.e., AB, Ab, aB, and ab. In comparison with their parental genotype AB/ab, 
AB, and ab are non-recombinants, with equal frequency 5 (1 — r): Ab, and aB are 
recombinants, with frequency žr. When the gamete-level selection is ignored, 
female and male gametes generated by the AB/ab individuals will have equal 
frequencies in theory. The random combination between female and male 
gametes generated by the same individual will produce the selfed progeny of the 
individual. Genotypes and their frequencies from the random combination 
between female and male gametes is given in table 2.1. Diagonal items in table 2.1 
give the four homozygous genotypes and their frequencies in the selfed progeny, 
and therefore correspond to the first, third, eighth, and tenth elements in row 5 in 
transition matrix Ts (equation 2.4). Type 2 can be produced in two ways, i.e., 
combination between female gamete AB and male gamete Ab, and combination 
between female gamete Ab and male gamete AB (table 2.1). Therefore, the fre- 
quency of type 2 in the selfed progeny is }r(1—r) + 1r(1 — r) =$r(1—1), 
corresponding to the second element in row 5 in transition matrix Ts (equa- 
tion 2.4). Similarly, frequencies of types 4-7 and 9 in the selfed progeny can be 
calculated from table 2.1, and the fourth to seventh and ninth elements in row 5 in 
transition matrix Ts can be identified (equation 2.4). 


Type 6 (Ab/aB) is in a similar situation as type 5 (AB/ab). The only difference is 


the shift relationship between recombination and parental type in gametes. As far as 
type 6 (Ab/aB) is concerned, gametes AB and ab become recombinants, with equal 
frequency 4r; Ab and aB become non-recombinants, with equal frequency 3 (1 — r). 
Taking a look at row 5 and row 6 in transition matrix Ts, it can be seen that 1 — rin 
row 5 becomes r in row 6, and 1 — rin row 5 becomes r in row 6 (equation 2.4). 


Estimation of the Two-Point Recombination Frequencies 


© O ain o 


o anit co 


oomdcooöc o 


(2.4) 


57 


58 


Linkage Analysis and Gene Mapping 


TAB. 2.1 — Frequencies of female and male gametes generated by double heterozygote AB/ab 
and their random combinations to produce the selfed progeny. 


Female gametes Male gametes and frequencies 
and frequencies 1 1 1 1 
AB,-(1—r) Ab, =r aB, =r ab, -(1 — r) 
2 2 2 2 
1 Type 1: AABB "Type 2: AABb Type 4: AaBB Type 5: AB/ab 
AB, -(1 —r) 
i -A grt- Grr) g0- 
rime err 47ü-r 40-7T 
dö Type 2: AABb "Type 3: 4405 Type 6: Ab/aB Type 7: Aabb 
1 1, t 1 
y7” ral a’ 474-r) 
1 Type 4: AaBB "Type 6: Ab/aB Type 8: aaBB Type 9: aaBb 
aB,-r 
2 La 1, f3 Lan 
“... a” a? gra 
oh eee Type 5: AB/ab "Type 7: Aabb Type 9: aaBb Type 10: aabb 
i 1 3 1 1 1 y 
40-7) gyan 47ü-r) 40-7) 


2.1.4 Transition Matriz of Doubled Haploid 


The transition matrix of doubled haploid (DH), represented by Tp, is given in 
equation 2.5, which is much simple since the frequencies of diploid genotypes in the 
DH population are equal to frequencies of their haploid gametes. Similar to selfing, 
three cases will be considered on the number of heterozygous loci. 


(1) 


No heterozygote, or homozygous at both loci. Four of the 10 genotypes belong 
to this case, i.e., type 1 (AABB), type 3 (AAbb), type 8 (aaBB), and type 10 
(aabb). Homozygote generates only one type of gametes, and doubled haploid of 
the gametes has exactly the same genotype as the homozygous parent. 
Therefore, in transition matrix Tp, the first element in row 1, the third element 
in row 3, the eighth element in row 8, and the tenth element in row 10 are equal 
to 1; the other elements in rows 1, 3, 8 and 10 are all equal to 0 (equation 2.5). 
One-locus heterozygote and one-locus homozygote. At the heterozygous locus, 
two types of gametes will be generated, with equal frequency. For example, for 
type 2 (AABb), two types of gametes are AB and Ab, and their doubled haploid 
genotypes become type 1 (AABB) and type 3 (AAbb) in DH progeny with equal 
frequency 5 corresponding to the first and third elements in row 2 in transition 
matrix Tp (equation 2.5). The other elements in row 2 are equal to 0. Type 4 
(AaBB), type 7 (Aabb), and type 9 (aaBb) have similar probability vectors as 
type 2. 

Double heterozygote, i.e., type 5 (AB/ab), and type 6 (Ab/aB). The 4 
homozygous genotypes are all present in their DH progeny. Take type 5 
(AB/ab) as an example to illustrate the genotypic frequencies in DH progeny of 
double heterozygotes. Individuals with genotype AB/ab generate four types of 
female and male gametes, i.e., AB, Ab, aB, and ab. In comparison with their 
parental genotype AB/ab, AB and ab are non-recombinants, with equal 
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frequency $(1—r); Ab and aB are recombinants, with equal frequency İr. 
Their doubled haploid genotypes become type 1 (AABB), type 3 (AAbb), type 8 
(aaBB), and type 10 (aabb) in DH progeny with frequencies 1(1 — r), 4r, ir, 
and 2(1 — r), respectively, corresponding to the first, third, eighth, and tenth 
elements in row 5 in transition matrix Tp (equation 2.5). 


Type 6 (Ab/aB) is in a similar situation as type 5 (AB/ab). The only difference is 
the shift relationship in recombination in gametes. As far as type 6 (Ab/aB) is 
concerned, gametes AB and ab become recombinants, with equal frequency 513 Ab 
and aB become non-recombinants, with equal frequency 1(1 — r). Taking a look at 
row 5 and row 6 in transition matrix Tp, it can be seen that 1 — rin row 5 becomes 
rin row 6, and 1 — rin row 5 becomes r in row 6 (equation 2.5). 


1 0 0 0 0 0 0 0 0 0 
1 1 
3 0 2 0 0 0 0 0 0 0 
0 0 1 0 0 0 0 0 0 0 
1 1 
3 0 0 0 0 0 0 3 0 0 
1 1 1 1 
-(1-r) 0 =r 0000 =r 0 —(1-—r) 
2 2 2 2 
Tp = i : (2.5) 
1 1 
0 0 = 0 0 0 0 0 0 = 
2 2 
0 0 0 0 0 0 0 1 0 0 
1 1 
0 0 0 0 0 0 0 3 0 2 
0 0 0 0 0 0 0 0 0 1 


2.1.5 Transition Matrix of Repeated Selfing 


Repeated selfing is commonly used to develop pure inbred lines in plants. Repeated 
selfing can begin with some controlled crosses, such as single crosses between two 
parents and double-crosses between four parents. Sometimes it can also begin with 
random mating populations, such as a naturally pollinated maize population in the 
field. The frequency of heterozygotes at each locus is reduced by half after one 
generation of selfing. After a number of generations of selfing, only homozygotes, 
i.e., AABB, AAbb, aaBB, and aabb, are remained in the progeny population. The 
remained homozygotes are actually the same as those produced by doubled haploids 
but have different frequencies. The overall frequency of the four homozygotes 
becomes 1 immediately after one generation of doubled haploid; while the 
homozygous frequency approaches 1 gradually during the repeated selfing. 
Crossing-overs can accumulate during repeated selfing until the homozygotes have a 
total of frequency 1 in the progeny population. Therefore, more crossing-over events 
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can be observed in the repeated selfed progeny than those observed in the DH 
population. The transition matrix of repeated selfing is represented by Tp and given 
in equation 2.6. Similar to one generation of selfing and doubled haploid, three cases 
will be considered on the number of heterozygous loci. 


1 0 0 0000 0 0 0 
1 1 
z 0 z 0000 0 0 0 
2 2 
0 0 1 0000 0 0 0 
1 1 
5 0 0 0000 5 0 0 
1 1 1 1 
“0-2 0 O 0000 | 0 =(1—R) 
2 2 2 2 
Tr = i i i (2.6) 
= Edo -(1— = 
R 0 50-8) 0000 .0-B 0 | 
1 1 
= 0 = 
0 0 5 0000 0 3 
0 0 0 0000 1 0 0 
0 0 0 0000 : 0 : 
2 2 
0 0 0 0000 0 0 1 


(1) 


No heterozygote, or homozygous at both loci. Four of the 10 genotypes belong 
to this case, i.e., type 1 (AABB), type 3 (AAbb), type 8 (aaBB), and type 10 
(aabb). Homozygote does not change its genotype in the progeny population 
through repeated selfing propagation. Therefore, in transition matrix Tp, the 
first element in row 1, the third element in row 3, the eighth element in row 8, 
and the tenth element in row 10 are equal to 1; the other elements in rows 1, 3, 8 
and 10 are all equal to 0 (equation 2.6). 

One-locus heterozygote and one-locus homozygote. At the heterozygous locus, 
two homozygotes follow the 1:1 segregation ratio after repeated selfing. Take 
type 2 (A ABD) as an example. The repeated selfed progeny of A AB? have type 1 
(AABB) and type 3 (AAbb), with equal frequency $, corresponding to the first 
and third elements in row 2 in transition matrix Tp (equation 2.6). Type 4 
(AaBB), type 7 (Aabb) and type 9 (aaBb) belong to this case as well. Non-zero 
elements in rows 4, 7, and 9 in transition matrix Tp can be determined 
according to two of their homozygous genotypes in the repeated selfed progeny. 
Double heterozygote, i.e., type 5 (AB/ab) and type 6 (Ab/aB). Double 
heterozygote generates four homozygotes through repeated selfing. Two of 
them belong to recombinant types with an accumulated frequency of R, and the 
other two belong to non-recombinant types with an accumulated frequency of 
1-R. Since two alleles at each locus have equal frequency in double 
heterozygotes, two recombinant homozygotes have equal frequency, i.e., R; 
two non-recombinant homozygotes have equal frequency either, i.e., 3(1 — R). 
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As far as type 5 (AB/ab) is concerned, homozygotes AABB and aabb are 
non-recombinants, with equal frequency 2 (1 — R); homozygotes AAbb and 
aaBB are recombinants, with equal frequency 3R. Four homozygotes AA BB, 
AAbb, aaB,B, and aabb have frequencies 2(1 — R), $R, 4R, and 1(1 — R), 
respectively, in the repeated selfing progeny, corresponding to the first, third, 
eighth, and tenth elements in row 5 in transition matrix Tp (equation 2.6). 


As far as type 6 (Ab/aB) is concerned, homozygotes AABB and aabb become 
recombinants, with equal frequency sR; homozygotes AAbb and aaBB become 
non-recombinants, with equal frequency 5 (1 — R). Four homozygotes AABB, AAbb, 
aaBB, and aabb have frequencies $ R, 1(1 — R), 1(1 — R) and § R, respectively, in the 
repeated selfing progeny, corresponding to the first, third, eighth, and tenth ele- 
ments in row 6 in transition matrix Ta (equation 2.6). 

As far as the genotype of F; is concerned, i.e., AB/ab, R used in equation 2.6 is 
equal to the frequency of two recombinant homozygotes AAbb and aaBB in the 
progeny population after several generations of selfing. By using matrix and Markov 
chain theories, the relationship between accumulated (R) and one-meiosis (r) 
recombination frequencies can be proved (see exercise 2.8 for details), as given in 
equation 2.7. It can be seen from equation 2.7 that accumulated frequency R is 
about two times the one-meiosis frequency when r is small. In other words, repeated 
selfing can expand the recombination frequency between closely linked markers, and 
therefore increase the resolution of genetic linkage maps. 

2r R 


or r= = (2.7) 


HT De 2(1— R) 


2.1.6 Expression of the Two-Locus Genotypic Frequencies 
in Matrix Format 


Let the F, hybrid between parents Pı (AABB) and Pə (aabb) be generation 0. Type 
5 (AB/ab) is the only genotype in generation 0 and therefore has frequency 1. The 
probability vector of generation 0 is given in equation 2.8. 


p®=[0 00010000 0 (2.8) 


Based on transition matrices, as have been given in equations 2.2-2.6 for five 
mating systems, two-point theoretical genotypic frequencies in most bi-parental 
populations can be calculated. Table 2.2 gives the expressions from p® and previ- 
ously defined matrices in 20 bi-parental populations. For example, P,;BC,F, is 
generated by backcrossing the F, hybrid with parent P,, so the genotypic frequen- 
cies is given by p® x Tpip; Fə is generated by one generation of selfing since the F, 
hybrid, so the genotypic frequencies are given by p® X Ts. If there are two gener- 
ations of backcrossing or selfing, the transition matrix has to be multiplied twice, 
such as F3, PıBCəF;, and P2BC2F,. For some populations, two different mating 
systems may be used, such as P;BC,RIL, and P BC} Fo. 
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TAB. 2.2 — Two-point theoretical genotypic frequencies expressed by generation transition 
matrices in 20 bi-parental populations. 


Population number Population name Expression on theoretical 
genotypic frequencies 

1 P:BCiF: p” x Tpin 

2 P.BCiF: p% x Tpop 

3 F,DH p® x Tp 

4 F RIL pO x Tr 

5 P,BC,RIL p x Trig X Tr 

6 P.BC:RIL p” x Tpən X Ta 

7 Fy px Ts 

8 F3 p® x Ts x Ts 

9 P,BCŞF, p x Tpin X Trip 

10 P BCF; p x Tpog X Tpog 

11 P BC RIL p x Trig X Trip X Tr 

12 P.BCəRIL p% x Tpop X Tpən X TR 

13 P:BC:F, p” x Trip X Ts 

14 PBC, Fo p x Tpos X Ts 

15 P BCF» p x Tpip X Tpip X Ts 

16 P BCF> p x Tpog X Tpog X Ts 

17 P,BC,DH p x Tpiz X Tp 

18 P.BC,DH pÜ) x Tpən X Tp 

19 P:BC,DH p% x Trip X Tri X Tp 

20 P.BC,DH p% x Trop X Tpop X Tp 


From the expressions given in table 2.2, theoretical genotypic frequencies can be 
found, and then used to estimate the unknown recombination frequency, which is 
the major content in the next two sections of this chapter. Here it may be worthwhile 
mentioning that elements in the probability vector can also be arranged by 
column, which is equivalent to the transpose of the row vector, such as equation 2.8. 
If this is the case, all generation transition matrices previously defined from 
equations 2.2—2.5 should be transposed as well; all expressions given in table 2.2 
should be transposed either. 


2.2 Theoretical Genotypic Frequencies at Two Loci 


2.2.1 Theoretical Frequencies of 10 Genotypes at Two Loci 


Two-point theoretical genotypic frequencies are given in table 2.3 for 20 bi-parental 
populations shown in figure 1.1. These frequencies provide the theoretical basis for 
estimating the two-point recombination frequency (Sun et al., 2012; Nelson, 2011). 
Ten genotypes are considered in theoretical deduction and table 2.3. However as said 
earlier, two types of double heterozygotes cannot be distinguished without 


Population 
P,BC|F, 
P2BC) Fy 
F,DH 
F,RIL 

P, BC) RIL 
PBC RIL 
F2 

Fs 
P:BCəF) 
P.BC2F) 
P,BC,RIL 
PBCRIL 
P\BC|F2 
PBC) F2 
Pı BCF 
P2BC2F> 
P,BC,DH 
P?BC,DH 
P,BC,DH 


P,BC:DH 


TAB. 2.3 — Theoretical frequencies of 10 genotypes at two loci in 20 bi-parental populations. 
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additional information. In addition, when two alleles at one locus are dominant or 
recessive, more genotypes may become undistinguishable, especially in temporary 
populations. In practical populations, sample sizes can be counted only on identi- 
fiable genotypes. Under this situation, the undistinguished types have to be merged 
to have one theoretical frequency. This issue will be discussed next in detail. 


2.2.2 Theoretical Frequencies of 4 Homozygotes 
in Permanent Populations 


To be clear, frequencies of the four homozygotes in 10 permanent populations are first 
given in table 2.4. It can be seen that F:DH has the simplest expressions on theoretical 
frequencies. The proportion of two recombination homozygotes AAbb and aaBB is 
exactly equal to r; the proportion of two non-recombination homozygotes AA BB and 
aabb is exactly equal to 1 — r. Therefore, the observed proportion of homozygotes 
AAbb and aaBB is the estimate of recombination frequency in the DH population 
coming from the doubling of female or male gametes produced by the F; hybrid. Due 
to its simplicity, Fj DH has been frequently used as an example to illustrate many 
genetic analysis methods including gene mapping on quantitative traits. 

In F,RIL, proportion of the recombination homozygotes is equal to R; propor- 
tion of the non-recombination homozygotes is equal to 1 — R. Therefore, the 
accumulated frequency R can be firstly estimated by the observed proportion of 
homozygotes AAbb and aaBB in F,RIL, and then the one-meiosis frequency can be 
estimated by equation 2.7. 

Expressions on homozygous frequencies are less simple in other permanent 
populations. However, one common term can still be identified among the four 
homozygous frequencies, i.e., (1 — r)(1 — R) in P,BC,RIL and P.BC,RIL, 
(1 — r)”(1 — R) in P,BCRIL and P,BCARİL, (1 — r)” in P}BC,DH and P,BC,DH, 
and (1 — r)? in PyBC,DH and P.BC,DH (table 2.4). In these populations, the com- 
mon term can be estimated first from the observed numbers of four homozygotes, so as 
not to use more complicated iterative algorithms. 


2.2.3 Genotypic Frequencies of Two Co-Dominant Loci 
in Temporary Populations 


For two co-dominant loci, each locus has three identifiable genotypes, resulting in 
nine identifiable genotypes when the two loci are considered together. As discussed 
in §2.1, two types of double heterozygotes as given in table 2.3 have to be combined 
to have one identifiable genotype, i.e., AaBb. In a practical population, only the 
number of individuals having genotype AaBb can be counted, rather than AB/ab 
and Ab/aB separately. Therefore, the expected frequency of genotype AaBb is 
needed when using a sample size of AaBb in estimating the recombination frequency. 
By summing up the frequencies of two double heterozygotes as given in table 2.3, 
frequencies of nine identifiable genotypes can be acquired and given in table 2.5 in 10 
temporary bi-parental populations. Based on the theoretical frequencies given in 
table 2.5, the likelihood function can be constructed from the observed numbers of 


Population 
F,DH 
F:RIL 
P,BC,RIL 
PBC RIL 
P,BC2RIL 
P2BCəRIL 
P,BC,DH 
P,BC,DH 
P,BC.DH 


P,BC2DH 


Notes: r is the one-meiosis recombination frequency; R is the accumulated frequency during repeated selfing, i.e., R = 


TAB. 2.4 — Frequencies of homozygous genotypes in permanent populations. 
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TAB. 2.5 — Frequencies of nine identifiable genotypes in temporary populations when alleles A and a are co-dominant, and alleles B and b are 


co-dominant. 
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P.BCiFI 27 ə” 0-7) 

1 1 

-r(1—r “ə. -r(1—r 2(1— r 
F, arı”) ra (1-r) oon 

1 2 1 1» 2 2 4 4 
F gre nl Se) arta? (1-r) s T(1-— r)(1-— r-r?) a! =r)+>(1-r) +r 
S y 1 1 

2 A2 t ta ye 

PBOP, 37307” 440”? — 

1 2 1 1» 2 3 
P:BCiF, gao” r” (1-r) —r(1— r) 3! =r) 

1 1 2 "aa 1 j ; 
P»BC,F> qrtqr-) gr rgi (1 — r) MEE a grt gl?) 

1 3 1 1 2 2 3 1 4 
P BCF, z- TE (1 — r?) grr) w” 

1 1 2 2 101 2 2 1 2 s Əql 2 4 
P BCF, rey ae (l-r+r*) 167168”) G-r“) ia ee (1- r-r”) 5750”? +g” 


Notes: blank represents the corresponding genotype is absent in the population, and its theoretical frequency is 0; ris the one-meiosis recombination frequency. 


Surddeyy əuəo5 pue sısAyeuy oseyury 


Estimation of the Two-Point Recombination Frequencies 69 


identifiable genotypes, and then the recombination frequency can be estimated (see 
§2.3 for details). 

Obviously, one generation of backcrossing with either parent gives the same 
theoretical frequencies (table 2.5) as F,;DH (table 2.4), even though the four geno- 
types are different in these populations. In fact, genotypes in FıDH, P,;BC,F,, and 
P,BC,F, can all represent the four gametes produced by the F hybrid. Therefore, 
when both loci are co-dominant, the three populations are equivalent in estimating 
the recombination frequency. 


2.2.4 Genotypic Frequencies of One Co-Dominant Locus 
and One Dominant Locus in Temporary Populations 


When alleles A and a are co-dominant, and allele B is dominant to allele b, BB and 
Bb become un-distinguishable at the dominant locus. When both loci are considered 
together, there are six distinguishable types, i.e., (1) AAB_ (incl. AABB and 
AABb), (2) AAbb, (3) AaB_ (incl. AaBB and AaBb), (4) Aabb, (5) aaB_ (incl. 
aaBB and aaBb), and (6) aabb. In table 2.5, the summation of two frequencies of 
AABB and AABb will give the theoretical frequency of mixed type 1 (AAB_); 
summation of two frequencies of AaBB and AaBb will give the theoretical frequency 
of mixed type 3 (AaB_); summation of two frequencies of aaBB and aaBb will 
acquire the theoretical frequency of mixed type 5 (aaB_). Theoretical frequencies of 
the six distinguishable types are given in table 2.6. 

It can be seen from table 2.6, two distinguishable types, i.e., AAB_ and AaB_, 
have frequencies 5 and 5 in PyBC,F,, and frequencies 3 and ł in PıBCəF1. The other 
types are absent in P ıBCF; and P BCF. Recombination frequency is not included 
in theoretical frequencies, and therefore cannot be estimated between one 
co-dominant locus and one dominant locus in backcrossed populations with parent 
Pı. The four non-zero frequencies in P,BC,F, (table 2.6) are the same as F,DH 
(table 2.4). Therefore, P2BC,F,; and F,DH are equivalent when one locus is 
co-dominant and one is dominant. 


2.2.5 Genotypic Frequencies of One Co-Dominant Locus 
and One Recessive Locus in Temporary Populations 


When alleles A and a are co-dominant, and allele B is recessive to allele b, Bb and bb 
become un-distinguishable at the recessive locus. When both loci are considered 
together, there are six distinguishable types, i.e., (1) AA BB, (2) AA b (incl. AABb 
and AAbb), (3) AaBB, (4) Aa b (incl. AaBb and Aabb), (5) aaBB, and 
(6) aa b (incl. aa Bb and aabb). In table 2.5, summation of two frequencies of AA Bb 
and AAbb will give the theoretical frequency of mixed type 2 (AA 5): the sum- 
mation of two frequencies of AaBb and Aabb will give the theoretical frequency of 
mixed type 4 (Aa_b); summation of two frequencies of aaBb and aabb will give the 
theoretical frequency of mixed type 6 (aa 6). Theoretical frequencies of the six 
distinguishable types are given in table 2.7. 


TAB. 2.6 — Frequencies of six identifiable genotypes in temporary populations when alleles A and a are co-dominant, and allele B is dominant to 


allele b. 


Population 
P BOF, 
P2BC\F, 
Fy 

Fs 

P BCF; 
PBCəF, 
P| BC F, 
P2BC\F2 

P| BCoF2 


P BCF, 


AAB (or AABB + AABI) AAbb 

1 

2 

1 2 l o 

əmu. rül 

1/3 ME ERE. hu 
G ror +2r -r aer A r) 

3 

4 

pt gd r+) irti- r) 
a- ratr irti(i- r) 

3, 1 TI 1 1 2 2 
a7 760 r)’ (1+r) 16 ig r) (1 -— r?) 
nu 9. 
wT” (1+r) 16 TAG r) (1 -— r?) 


AaB (or AaBB + AaBb) 


1 

2 

1 

20-r) 

1 : 
210757) 

1/1 ) kə 3, 4 
G r+2r 2re+r 
1 

4 

1 F 

qty: 

1 1 

ri id r)(1 — rr?) 
1 : 
qabarı 

1 1 a 

8 = gr = r” 

1 

8 


OL 
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Population 


P| BC\F, 
P.BC\F, 


Fp 


F 
P ıBCF; 
P2BC3F, 


PBC Fə 
P2BCF2 
PıBC2F2 


P BCF, 


- 
g 

a 
sə 


3 


NLR Nol — NI Fe 


x 
— 


nn... ALR Ale 
3 
— 
— 
| 
3 
— 


3 şü r)?(4 — 4 


r?) 


Tas. 2.6 — (continued). 
aaB (or aaBB + aaBb) 


ir 

Ere- r) 

1 1 

yty r)(2—r+r?) 
1 il 

ar grt r)(2—r) 

Sr m r)(2 — r) 
1-a- 

k di a Ta ” 


aabb 

0-9) 

1 

qty: 

Lü özü ry a 
1031 j 

ae aa EE 

9:1 

s-r 

1003 3 
37a" t307” 

1 

oae 

5 1 pgi : 
255077 647 


Notes: blank represents the corresponding genotype is absent in the population, and its theoretical frequency is 0; r is 


recombination frequency. 


the one-meiosis 
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TAB. 2.7 — Frequencies of six identifiable genotypes in temporary populations when alleles A and a are co-dominant, and allele B is recessive to 


allele b. 


Population 
P| BC,F, 
P.BC\F, 
F, 


F3 


P:BCəF, 
P2BCəF, 
PBC F2 


P BCF, 
P BCF, 


P BCF, 


AABB 
20-9) 

1 

21707 

r "sü ry art 
1 1 3 

m ni Uu 

z307 

11 1 : 
25:77 510 

1 ğ 

217” 

5 1 . J 
“una mu 
1 

7) 


AA, b (or AABb + AAbb) 


r 


2 

1 

ye” 

r i tra r)(2—r+r?) 
1 1l A 

EER RE 

4240”? 

3 1 

a” "ul r)(2 — r) 

1 1 

a” gr(l r)(2 — r) 

3 1 5: 
məs 17761”) 
1 1 à 

p p” 


Ble NLR NIE 


Qolu l= BIR BIE 
= 


GL 
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Tas. 2.7 — (continued). 


Population 
P| BCF, 
P2BC\F, 
Fy 

Fs 

P| BC2F, 
P2BCəF: 
P| BC F2 
P2BC\F2 
P| BC2F2 


P BCF, 


Aa_b (or AaBb + Aabb) 


1 

20-7) 

1 

2 

1 : 
dər 

1/1 ə ə 3. A 
G r+2r 2r ə. 
1 

Xüc 

1 

4 

1 2 
g&o- rtr) 

1 1 | 2 
artat r)\(l—r+r*) 
1 ; 
AD Lara) 

1 1 

05.2” 


aaBB 

Te 

4 

1 1 
“0 

1 1 
5555000 

1 1 
un 

1 1 2 2 
T 16 (4 r) (1-1) 
11 Soci. od 
6077770 


aa, b (or aaBb + aabb) 


1 

2 

1 

2(1— r? 

10-7) 

1/3 : : 
G r-r +2r’ 2 


NI] ole eI 
+ 


1 375 5 
tCar) 


Siegl- 


Notes: blank represents the corresponding genotype is absent in the population, and its theoretical frequency is 0; r is the one-meiosis 


recombination frequency. 
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It can be seen from table 2.7, two distinguishable types, i.e., Aa $ and aa_b, 
have frequencies £ and 5 in PəBCiF/, and frequencies } and ł in P2BC2F). The other 
types are absent in P2BC,F, and P2BC2F,. Recombination frequency between one 
co-dominant locus and one recessive locus cannot be estimated in backcrossed 
populations with parent Pj. The four non-zero frequencies in P:BCiF,) (table 2.7) 
are the same as F,DH (table 2.4). Therefore, P};BC,F, and F,DH are equivalent 
when one locus is co-dominant and one is recessive. 


2.2.6 Genotypic Frequencies of Two Dominant Loci 
in Temporary Populations 


When allele A is dominant to allele a, and allele B is dominant to allele b, AA and Aa 
are un-distinguishable, and neither are BB and Bb. When both loci are considered 
together, there are four distinguishable types, i.e., (1) A_B_ (incl. AABB, AABBb, 
AaBB and AaBb), (2) A bb (incl. AAbb and Aabb), (3) aaB (incl. aaBB and 
aaBb), and (4) aabb. In table 2.5, the summation of four frequencies of AABB, 
AABb, AaBB, and AaBb will give the theoretical frequency of mixed type 1 
(A B ): summation of two frequencies of AAbb and Aabb will give the theoretical 
frequency of mixed type 2 (A. bb), summation of two frequencies of aaBB and aaBb 
will give the theoretical frequency of mixed type 3 (aaB_). Theoretical frequencies 
of the four distinguishable types are given in table 2.8. 

Type A B is the only one in P,BC,F, and P :BCF;, indicating recombination 
frequency between two dominant loci cannot be estimated in backcrossed popula- 
tions with parent Pı. Four non-zero frequencies in P2BC,F, (table 2.8) are the same 
as F,DH (table 2.4). Therefore, P2BC,F, and F,DH are equivalent in estimating the 
recombination frequency between two dominant loci. 


2.2.7 Genotypic Frequencies of One Dominant Locus 
and One Recessive Locus in Temporary Populations 


When allele A is dominant to allele a, but allele B is recessive to allele b, AA and Aa 
are un-distinguishable at the dominant locus, and Bb and bb are un-distinguishable 
at the recessive locus. When both loci are considered together, there are four dis- 
tinguishable types, i.e., (1) A_ BB (incl. AABB and AaBB), (2) A _b (incl. AABb, 
AAbb, AaBb, and Aabb), (3) aaBB, (4) aa b (incl. aaBb and aabb). In table 2.5, the 
summation of two frequencies of AABB and AaBB will give the theoretical fre- 
quency of mixed type 1 (A_ BB); the summation of four frequencies of AA Bb, A Abb, 
AaBb, and Aabb will give the theoretical frequency of mixed type 2 (4 _ 8); 
summation of two frequencies of aaBb and aabb will give the theoretical frequency of 
mixed type 4 (aa_b). Theoretical frequencies of the four distinguishable types are 
given in table 2.9. 

It can be seen from table 2.9 that recombination frequency is not included in 
theoretical genotypic frequencies in any backcross populations. Therefore, recom- 
bination frequency between one co-dominant locus and one recessive locus cannot be 
estimated in PyBC,F,, PyBCoF,, PoaBC ,F), and P2BC2F). 


TAB. 2.8 — Frequencies of four identifiable genotypes in temporary populations when allele A is dominant to allele a, and allele B is dominant to 
allele 5. 


Population A_B_ A_ bb (or AAbb + Aabb) aaB_ (or aaBB + aaBb) aabb 

(or AABB + AABb + AaBB + AaBb) 
P,BC,F, 1 

1 1 1 1 
PBCiF) z0 —r) 37 ar g(t —r) 

ədə wa bn La ağ 
F2 zra re r) ye r) rl r) 

11 1 Aqla b, nn he 2 eee ce ee əə Wat Le 
F; 5 ait git r) tet get gr r)(2— rr?) uru r)(2—r+r°) qü r) -(1-r) +r 
P BCF; 1 

1 3 1 1 2 1 1 2 1 1 2 
P.BCiF, z 10-ə” art sü — r)(2 — r) art sü — r)(2 — r) za- r}? 

1 1 3 ə 1 əə 3 1 D ee 1 1. 1 xe 
P,BC\F2 ri =r)+z(l-r) 3 grt r)(2—r) a gr(l r)(2 — r) 5 art git r) 

7 A a : 101 i 1 i 
PıBCF2 8 167 r) 16 wY r) 16 wl r) TAS r) 

1 2. 1 7 301 "ə “ə 220094 5 35.7 shot i 
P.BCŞF, 3l r) +t r) 16 şü r) 16 (l-r) 16 şü r) l r) 3 şü r) + 7g r) 


Notes: blank represents the corresponding genotype is absent in the population, and its theoretical frequency is 0; r is the one-meiosis recombination frequency. 
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TAB. 2.9 — Frequencies of four identifiable genotypes in temporary populations when allele A is dominant to allele a, and allele B is recessive to 
allele b. 


Population A BB A b aaBB aa, b (or aaBb + aabb) 
(or AABB + AaBB) (or AABb + AAbb + AaBb + Aabb) 
P,BC\F : : 
1BCıFı 2 2 
1 1 
P2BC,\F, 2 2 
1 1 Ve a 1. 1 
F» 277 270” a gür) 
1/3 7 29,3 4 1 fs eg: 3 ya 1 l 2g m2 1/3 : 20 6/3 4 
Fs G r— r” +2r T "ua 2r” +r“) "urn r) m r-r +2r j 
P BCF 3 : 
1BO2Fy 2 4 
P.BCoF : 3 
2BCoFy ri 1 
1 1 1 1 1 1 : 
P BC F> 2-10 or şü r)(2 r?) art gra? ger) te) 
P:BC,F: 50 - 0y0--n) Sr (1 r)(2--r7) art grl- r) Stall- r+) 
3 1 ; 2.4 : ud : 1 2 
SFe H H 1 1 l+r 1 l+r 
PBO f+a0-N+) 6-0 - 9-0) 60-60-0900) 
1 3 3 1 3 1 1 3 əsl 3 
; oF. —(l-r)(1 1 A CL 1 l+r 1 14 
PBO 50 - Ö”13-0) oat Ptr) "mumu. 


Notes: blank represents the corresponding genotype is absent in the population, and its theoretical frequency is 0; r is the one-meiosis 
recombination frequency. 
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2.2.8 Genotypic Frequencies of Two Recessive Loci 
in Temporary Populations 


When allele A is recessive to allele a, and allele B is recessive to allele b, Aa and aa 
are un-distinguishable, and neither are Bb and bb. When both loci are considered 
together, there are four distinguishable types, i.e., (1) AABB, (2) AA 6 (incl. AABb 
and AAbb), (3) _ @BB (incl. AaBB and aaBB), and (4) a b (incl. AaBb, Aabb, 
aaBb and aabb). Theoretical frequencies of the four distinguishable types can be 
acquired by summing up the included genotypes, which are given in table 2.10. 

Type A B is the only one in P2BC,F, and P2BC.F), indicating recombination 
frequency between two recessive loci cannot be estimated in backcrossed populations 
with parent Pz. The four non-zero frequencies in P;BC,F, (table 2.10) are the same 
as F,DH (table 2.4). Therefore, P; BC F, and F,DH are equivalent in estimating the 
recombination frequency between two recessive loci. 


2.3 Estimation of Two-Point Recombination Frequency 


2.38.1 Maximum Likelihood Estimation of Recombination 
Frequency in DH Populations 


As has been seen previously from genotypic frequencies in 20 bi-parental popula- 
tions, the DH population is developed by directly doubling the haploids of female or 
male gametes produced by the F; hybrid, and therefore has the simplest population 
structure. Structure in population genetics is mainly referred to allelic and genotypic 
frequencies at one locus, and the joint genotypic frequencies when two or more loci 
are considered simultaneously, such as the two-point linkage analysis as has been 
shown in §2.2. Take the simplest population as an example to illustrate the principle 
of maximum likelihood in estimating recombination frequency. 

Assume two parents P) and Pə have genotypes AABB and aabb at two marker 
loci with recombination frequency r, which is to be estimated. Their F, hybrid has 
genotype AB/ab. During meiosis, the F, hybrid will generate four types of haploid 
gametes, i.e., AB, Ab, aB, and ab, where AB and ab are called parental (or 
non-recombinant) types, and Ab and aB are called non-parental (or recombinant) 
type. Based on the genetics principle of cross-over and recombination, two parental 
types have theoretical frequency 1 — r, two recombinant types have theoretical 
frequency r. In the F) hybrid population, allele frequencies are all equal to 0.5. As 
gametes, AB and ab have equal frequency; Ab and aB have an equal frequency. 
Therefore, the four gametes AB, Ab, aB, and ab have theoretical frequencies 
5(1—1r),57,47, and 2(1 — r), respectively, which are also the theoretical frequencies 
of homozygous genotypes AABB, AAbb, aaBB, and aabb in the F,-derived DH 
population. In table 2.11, mı and n4 are the numbers of DH lines having the two 
parental genotypes; mə and mə are the numbers of DH lines having the two recom- 
binant genotypes; the total population size is n = ny + m + ng + na. 


TAB. 2.10 — Frequencies of four identifiable genotypes in temporary populations when allele A is recessive to allele a, and allele B is recessive to 
allele b. 


Population AABB AA b (or AABb + AAbb) _aBB (orAaBB + aaBB) a b 
(or AaBb + Aabb + aaBb + aabb) 
1 1 1 
P:BCiF: 20-r7) ə r 20-r) 
P2BCiF) 1 
1 2 1 1 şa “2 
Fy “ei yen naza 27401”) 
1 cl ala Bolu r)(2 — rr?) si Ezo 1-1 y, l, 
Fs nil r)4 şü r) 4 RU A 474 r)\(2—r+r) J ar] g(t r) 4 3” 
19.4 > 1 1 A 1 1 2 1 E 
PiBGF, 5+ 5(1-7) 3-3- ") 3-30- 30-7) 
P2BCF; 1 
1 1 1 : 1 1 1 1 
PBC Fə 24 0227 sr gr(l r)(2—r) sr gr(l r)(2—r) müə gua 
P,BCF, 20 - r) art ((1- r)(2- r) art ar-N- r) Ttall- r 
Bod Pe 3-ci z 1 3 1 zj 2-1 öh 21 i 
oF: 1 1 1 l-r 1 1 1 1 
PBGF, 29 g0:- 7450-0726 g0:- 07 E S-a-nt-aa-n 5-4-9 
1 g i ca R is Al : 7 1 i 
BCF, +(l1-r qə Sa şər ən ee (lee 
PoBC2F2 16(1-7) ig ig —”? 6 mt” gig? 


Notes: blank represents the corresponding genotype is absent in the population, and its theoretical frequency is 0; r is the one-meiosis 
recombination frequency. 
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Take the barley DH population given in figure 1.7 as an example. For markers 
Act8A and OP06, A ABB is the genotype of parent Harrington, coded by (2, 2); aabb 
is the genotype of parent TR306, coded by (0, 0) when both markers are considered. 
Two recombinants are coded by (2, 0) and (0, 2), respectively. Counting from the 
genotypic data at Act8A and OP06 (i.e., first two markers given in figure 1.7), 
the four genotypic numbers can be acquired, i.e., 64, 8, 7, and 61, which are given at 
the end of table 2.11. Total population size is n = 140, when DH lines with missing 
genotypes are not counted. Here, line 3 has a missing genotype at Act8A, and lines 
55, 85, 105, and 120 have missing genotypes at TR306. As far as the two markers are 
concerned, the valid population size is 5 smaller than the actual size in estimating 
the recombination frequency between Act8A and OPO6. 


TAB. 2.11 — Expected frequencies and observed numbers of four homozygotes at two loci in 
DH population. 


Genotype AABB AAbb aaBB aabb 
Coding by number (2, 2) (2, 0) (0, 2) (0, 0) 

1 1 1 1 
Expected frequency p= 2 (l-r) p= 2 T P3 = 2” P4 = 2 (L=) 
Observed sample size m ng ng na 
Markers Act8A and OP06 64 8 T 61 


Given below is the procedure to follow in estimating the recombination frequency 
in the DH population by the maximum likelihood principle. 


(1) Construct the likelihood function on recombination frequency r. In table 2.11, 
the four genotypic numbers nı, mə, n3, and n4 are random variables, following a 
multinomial distribution with expected frequencies p1, pə, p3, and pq, respectively, as 
given in table 2.11. The distribution function is given in equation 2.9, where 
n = mi + nə + ng + ny is the total number of DH lines, and C = ao)" is an 


constant independent of the unknown parameter r. 


x. bi b a J i (5 r) i G r) f k 2 n| i (2.9) 


= ca m pyr t pn +n 


When recombination frequency ris known, equation 2.9 can be used to calculate 
the probability of any specific values on the four variables, and is therefore called 
the probability distribution function. However, the major purpose here is to esti- 
mate parameter r given that the four variables have received a set of observed 
values. Given the observed values of random variables, equation 2.9 becomes a 
function on the unknown parameter r, which is normally called a likelihood function 
in statistics for distinction and convenience. 


(2) Construct the log-likelihood function on recombination frequency r. By name, 
the maximum likelihood estimate is the value of r which maximizes the likelihood 
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function given in equation 2.9. However, in most cases it is difficult to find the 
maxima solution directly on the likelihood function. Derivatives on the logarithm of 
the likelihood function, or log-likelihood in short, are much easier without changing 
the maximum solution. The log-likelihood function of equation 2.9 is given in 
equation 2.10. 


In L(r) = In C+ (m + ng) In(1 — r) + (m2 + ng) In(r) (2.10) 


(3) Calculate first-order and second-order derivatives on log-likelihood. First-order 
and second-order derivatives on the log-likelihood in equation 2.10 are given in 
equations 2.11 and 2.12, respectively. 


dln L ; 
uzi — 60 (2.11) 
dr l-r r 
dln L mtn m+n 
[In L(r)|” = 27 Gan? 5 (2.12) 


(4) Calculate the solution of recombination frequency r which maximizes the like- 
lihood and log-likelihood functions, i.e., maximum likelihood estimate (MLE). Let 
the first-order derivative be equal to 0 to find out the MEL of r, which is given in 
2.13. As expected, the MLE of r is equal to the proportion of two recombinant 
genotypes in the DH population. 


Ya T NZ nə + Ng 


(2.13) 
m + nə + ng + na n 


r= 


(5) Calculate the variance of MLE on recombination frequency. The variance of 
MLE is acquired by Fisher’s information criterion. Fisher’s information is equal to 
the negative second-order derivative of the log-likelihood function. The variance of 
MLE is equal to the reciprocal of Fisher’s information. Fisher’s information on r in 
the DH population is given in equation 2.14. The variance of the MLE of ris given in 
equation 2.15. Square-root of equation 2.15 will give the standard error of the 
estimate. 


Tu +74 nə + Ng n 
nü. ets (1 — r)” r? — 9(1— ?) .. 
RA (2.15) 
I n 


(6) Likelihood ratio test on linkage relationship. The null hypothesis in the test is 
H: r= 0.5, i.e., the two loci are not linked, or independent by inheritance. An 
alternative hypothesis is FA: r < 0.5, i.e., the two loci are linked in one chromosome. 
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Let r = 0.5, i.e., when the null hypothesis is true, the value of the likelihood function 
as given in equation 2.9 is called the maximum likelihood of the null hypothesis. Let 
r =f as given in equation 2.3, i.e., when the alternative hypothesis is true, the value 
of the likelihood function is called the maximum likelihood of the alternative 
hypothesis. The two maximum likelihoods are given below. 


max L( Hj) = L(r = 0.5) = o(3) n 
max L( Ha) = L(r = ?) — ca 2 pu +m pret ne 


From the two maximum likelihoods, a test statistic can be defined by equation 
2.16, which is called the likelihood ratio test (LRT) in statistics. LRT as defined in 
equation 2.16 approaches a Chi-square distribution asymptotically. Under Ho, r is 
assumed to be 0.5; no parameter is to be estimated, or the number of parameters to 
be estimated is equal to 0. Under Ha, ris the only parameter to be estimated, or the 
number of parameters to be estimated is equal to 1. It has been proved that 
the degree of freedom of the Chi-square distribution is equal to the difference 
between the two numbers. Therefore, LRT as defined in equation 2.16 asymptoti- 
cally approaches a Chi-square distribution with one degree of freedom, and then the 
linkage relationship can be tested accordingly. 


max L( Ho) D" 
LRT = —21 = —21 
“max L(Hu) "1 - hal ca (2.16) 


= 2(m + na) In[2(1 — ®)] + 2(m + ng) 1n(27) ~ 77(1) 


From the four genotypic numbers from markers Act8A and OP0O6 in the barley 
DH population (table 2.11), it can be found MLE of recombination frequency 
7? = 0.1071 (equation 2.13), standard error of the estimate SE(7) = 0.0261 (square 
root of equation 2.15), likelihood ratio test statistic LRT = 98.44 (equation 2.16) 
with P = 2.88 x 1077”, indicating the two markers are closely linked on one 
chromosome. 


2.3.2 General Procedure on the Maximum Likelihood 
Estimation of Recombination Frequency 


The previous section focused on the DH population. Methods introduced in this 
section are more general and applicable to all populations developed from controlled 
crosses. Assume a number of k identifiable genotypes are included in the population 
with theoretical frequencies p; (i = 1, 2, ..., k). Based on genotyping data of n in- 
dividuals or lines in the population, the ith identifiable genotype has the observed 
number n; (i = 1, 2, ..., k). 
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(1) Construct the likelihood function on recombination frequency r. As random 
variables, genotypic numbers n; (i = 1, 2, ..., k) follow a multinomial distribution 
with expected frequencies p; (i = 1, 2, ..., k). Expected frequencies in bi-parental 
populations have been given in §2.1 and §2.2. Populations derived from two 
heterozygous parents or four to eight homozygous parents will be introduced in 
chapters 7 and 8. Given the observed genotypic numbers, the likelihood function is 
defined by equation 2.17. 


L(r) = r (p1)"" (p2)” + ++ (pi) (2.17) 


mng!+-+ ny! 


(2) Construct the log-likelihood function on recombination frequency r. In most 
cases, the derivatives of the likelihood function such as equation 2.17 are less 
tractable. However, it is less difficult to use the log-likelihood to calculate the 
solution of MLE. The ra likelihood function of equation 2.17 is given in equa- 


tion 2.18, where C = ris a constant, independent of recombination frequency. 


ALAR “Ti, 


In £(r) = ln C+ n ln pi + n İn pa + -+-+ + ny ln pk (2.18) 


(3) Calculate the first-order and second-order derivatives on log-likelihood. The 
first-order and second-order derivatives on the log-likelihood are given in equa- 
tions 2.19 and 2.20, respectively. 


k k . : 
[In (r) = ant) əra .. -52(%) (2.19) 


dink E ni (dp;\? E ni (dp; 
mi) = = = 5 (GE) DaT) (2.20) 
1 57 


i=1 Pi 


(4) Calculate MLE of recombination frequency r which maximizes the likelihood 
and log-likelihood functions. Only for a few populations, such as DH, RIL and 
BC,F,, MLE of r can be clearly given by solving the likelihood equation, t.e., letting 
the first-order derivative be equal to 0. For most populations, one distinct solution 
like equation 2.13 cannot be acquired and the iterative algorithms have to be used. 

Fortunately, both the first-order and second-order derivatives on log-likelihoods 
can be clearly given for all populations and categories of markers as given in 
tables 2.4—2.10, and therefore the Newton iterative algorithm (also called Newton- 
Raphson algorithm) can be applied. Given an initial value on recombination fre- 
quency, t.e., r), a new value, i.e., rü), can be calculated by equation 2.21 and then 
used as the new initial value. The iteration stops until the absolute difference between 
two succeeding values is smaller than a predefined threshold. The value acquired at 
the final iterative step is the MLE of recombination frequency. The threshold value 
used in Newtono-Raphson algorithm is a small positive number, such as 1071. 
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eee .ı (2.21) 
n 


(5) Calculate the variance of MLE on recombination frequency. As given in equa- 
tion 2.22, Fisher’s information is defined by the negative second-order derivative of 
log-likelihood; the variance of MLE is equal to the reciprocal of Fisher’s information. 
If wanted, the standard error of the estimates can be calculated by the square root of 
the variance given in equation 2.22. 

dln L 1 


j= V=- 2.22 
a E i (2.22) 


(6) Likelihood ratio test on linkage relationship. The null hypothesis in the test is Ho: 
r= 0.5, i.e., the two loci are not linked, or independent by inheritance. An alter- 
native hypothesis is Hy: r < 0.5, i.e., the two loci are linked in one chromosome. 
Based on maximum likelihoods under the two hypotheses, the likelihood ratio test 
(LRT) as defined in equation 2.23 approaches a Chi-square distribution with one 
degree of freedom and therefore can be used to test the significance of the linkage 
relationship. 

max L( Ho) 


LRT = —21 ——21 
m "maxL(H) “Trai 


= —2[n L(r = 0.5) — In L(r = *)|~77(1) 


(2.23) 


LRT as given in equation 2.23 is a logarithm function in the natural base 
e © 2.7183. In genetic studies, logarithm in the common base 10 is more frequently 
applied on the ratio of the two likelihoods, i.e., equation 2.24, which is called log- 
arithm of odd (LOD). Obviously, LRT is 4.6052 (i.e., 2In10) times LOD, or LOD is 
about 0.2171 times LRT (equation 2.24). However, as a statistic, the LOD score 
does not approach a Chi-square distribution. To find the significant probability of a 
linkage relationship, the value of LRT is still needed. 


LOD = zif ) x 0.217 LRT (2.24) 
ım, 


As a replacement for the Nevrton algorithm, the EM algorithm does not need the 
derivatives of log-likelihood (Dempster et al., 1977). The EM algorithm can be used 
to calculate MLE of r in some populations such as F, (see §2.3.5 for details) but is 
difficult to be implemented in other populations, such as F3, BCF, BC2F,, and 
BCŞF35. Therefore, Newton’s algorithm can act as a much more general method in 
linkage analysis for all populations and marker categories, given the first-order and 
second-order derivatives on log-likelihood can be clearly deducted. One other 
advantage of the Newton algorithm is that in addition to the MLE of r, variance and 
standard error of the MLE can be calculated at the same time at the end of iterations. 
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As an example, the curve of the log-likelihood function constructed from the four 
genotypic numbers and expected frequencies in table 2.11 is shown in figure 2.1, 
where the constant InC was not included. It can be observed that the maxima of 
log-likelihood occur between 0.08 and 0.14. Corresponding to the maxima point, the 
z-value is in fact the recombination frequency to be estimated, and the y-value is the 
maximum log-likelihood. 


Log-likelihood 
D 
G 


0.02 

0.08 F 
0.14 F 
0.2 F 
0.26 F 
0.32 F 
0.38 F 
0.44 F 
0.5 © 


Recombination frequency 


Fic. 2.1 — Logarithm likelihood (log-likelihood) between two markers in the barley DH 
population (constant term InC not included). 


Figure 2.2A and B show curves of the first-order and second-order derivatives on 
the log-likelihood function, respectively. The first-order derivative is positive for 
smaller z-value but decreases with the increase in z-value. An intersection point 
occurs between 0.08 and 0.14. The second-order derivative is always negative. Based 
on the calculus theory, the log-likelihood function has one and only one maxima in 
the interval (0, 0.5), and the maxima are achieved at the intersection of the 
first-order derivative to the z-axis. 
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Fic. 2.2 — The first-order (A) and second-order (B) derivatives of log-likelihood between two 
markers in the barley DH population. 


Setting an initial value of 0.01 and a precision of 0.0001, the Newton algorithm 
stops after 8 iterations and acquires an estimate of 0.1071 (table 2.12), the same as 
the direct estimation by equation 2.13. The second derivative is equal to —1463.47 at 


Iterations 
Estimated r 
In L(r) 

[In L(r)]’ 

[In L(r)]” 


TAB. 2.12 — Newton algorithm in estimating the recombination frequency in DH population. 


1 
0.0100 
—70.33 
1373.74 
—1.50 x 10? 


2 
0.0196 
—61.75 
655.83 
—4.10 x 105 


3 
0.0351 
—54.70 
297.38 
—1.23 x 105 


4 

0.0593 
—50.01 
119.90 
—4401.17 


5 

0.0866 
—48.02 
36.40 
—2150.79 


6 

0.1035 
—47.68 
5.49 
—1555.66 


T 

0.1070 
“47.67 
0.16 
—1466.11 


8 

0.1071 
—47.67 
0.00 
—1463.47 
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the maxima, based on Fisher’s information, the variance, and standard error of the 
estimate are given as follows. 


1 
I = 1463.37, V, — = 6.83 x 1077, SE, = V/V, = 0.0261 


2.3.3 Estimation of Recombination Frequency Between 
One Co-Dominant and One Dominant Marker 
in F Population 


As a more complex example, recombination frequency between one co-dominant 
locus and one dominant locus is considered in an Fy population. The expected 
frequencies of the six identifiable genotypes are represented by p; (i = 1, 2, ..., 6), 
which have been given in table 2.6. The observed genotypic numbers are represented 
by n; (i= 1, 2, ..., 6), and the total number of Fy, individuals is n. The likelihood 
function is given in equation 2.25, where the constant number C= 
.... alia 


m!ng! im mə lnel G 
Ln) gi-ri" Pa =r "ae =a 
= 70.25: r)” (1 _ “üə. .. r)*(1 -r+ ry” 


is independent of recombination frequency. 
(2.25) 


To apply the logarithm transformation at both sides of equation 2.25 to have the 
log-likelihood function, as given in equation 2.26. 


In L(r) = In C+ (2m + na + nə) In(r) +m İn(1 +r) 


+ (nı + nu +2ng) In(1 — r) + ns In(2 — r) + nəş İn(1 — r+ r?) (620) 


The first-order and second-order derivatives of the log-likelihood (equation 2.26) 
are deducted and given in equations 2.27 and 2.28, respectively. 


_ din ir) _2mt mtr n ma + ma + 2n 


/ 
.. dr T l+r l-r (2.27) 
oOo n% n(1— 2r) 
Jae 1-—r--r? 
In L(r)}" = d? In L(r) _ 2n + ma + ns m, m + ny + 2n 
7 AA TƏ G-n öm) 
ns _ 73(1+ 2r — 2r?) l 


2- (l—-r+r) 


Letting the first-order derivative (equation 2.17) be equal to 0 will result in a 
six-order polynomial equation on recombination frequency. It is very difficult to find 
the exact solution distinctly, and the iterative algorithm has to be used. Take an Fə 
population in wheat as an example. The cross was made between one 
disease-resistant parent and one susceptible parent, with the purpose to identify the 
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molecular markers linked with the resistant gene. As shown in table 2.13, for 
resistant Fə individuals, observed numbers on three marker types are 572, 1161, and 
14 respectively; for susceptible Fə individuals, observed numbers are 3, 22, and 569, 
respectively. Fitness tests given in table 1.3, chapter 1 indicated that resistance is a 
qualitative trait controlled by one dominant gene, the molecular marker is 
co-dominant, and the resistant gene and molecular marker are highly associated. 


TAB. 2.13 — Sample sizes of six distinguishable types at one co-dominant and one dominant 
disease-resistant locus, observed in one wheat F» population. 


Marker genotype Disease resistance 


Resistant (genotype B_) Susceptible (genotype bb) 


Type of resistant parent (44) nı = 572 nm 3 
Type of F; hybrid (Aa) ng = 1161 n = 22 
Type of susceptible parent (aa) ns = 14 ne = 569 


Notes: resistant is dominant to susceptible in phenotype; two alleles at the marker locus are 
represented by A and a; two alleles at the disease locus are represented by B and b. 


Setting an initial value at 0.001 and precision at 0.0001, the Newton algorithm 
stops after 9 iterations and acquires an estimate of 0.0179 for recombination fre- 
quency between the resistant gene and molecular marker (table 2.14). At the esti- 
mate, the log-likelihood reaches its maxima at —201.11, and the first-order 
derivative approaches 0. The second-order derivative at the estimate, i.e., 
—1.29 x 10°, can be used to find out the variance and standard error of the estimate. 


1 
è= 0.0179, 7 — 1.29 x 10”, V, — E 7.75 x 107”, SE, = V/V, = 0.0028 


2.3.4 Initial Values in Newton Algorithm 


Like most iterative algorithms, the converging speed of Newton algorithm depends 
on initial values. The closer the initial value is to the estimated value, the faster the 
convergence would be. When the initial value is too far away from the estimated 
value, the Newton algorithm may not converge to the estimated maxima. As far as 
recombination frequency is concerned, likelihood functions are different in different 
populations, and sometimes also depend on marker categories, e.g., equation 2.9 for 
the DH population, and equation 2.25 for one co-dominant locus and one dominant 
locus in the F» population. However, the valid region of recombination frequency is 
always 0-0.5, log-likelihood has a similar curve as shown in figure 2.1, and the two 
derivatives have similar curves as shown in figure 2.2A and B. One and only one 
maxima are there on the likelihood function, regardless of the type of population and 
category of markers. 

Figure 2.3 is a schematic representation of the Newton algorithm for calculating 
the MLE of recombination frequency. The objective is to search along the curve of 
the first derivative for its intersection with the z-axis. For an initial value 7) at the 


TAB. 2.14 — Newton algorithm in estimating the recombination frequency between one co-dominant marker and one dominant disease-resistant 
gene in a wheat F> population. 


Iterations 1 2 3 4 5 6 7 8 9 
Estimated r 0.0010 0.0019 0.0037 0.0066 0.0108 0.0151 0.0175 0.0179 0.0179 

In L(r) —282.75 —257.02 —234.28 —216.51 —205.69 -201.67 -201.12 -201.11 -201.11 

[In L(r)]’ 39 670.85 19 268.08 9081.59 4018.63 1548.44 430.81 50.88 -0.26 0.0071 
[InL(r)J" “4.12 x 107 —1.11 x 107 -3.10 x 10° -9.5 x 10” -3.58 x 10” -1.81 x 10” -1.35 x 10” —1.29 x 10” -1.29 x 10° 


88 


Surddeyy əuəo5 pue sısAyeuy aseyury 


Estimation of the Two-Point Recombination Frequencies 89 


point, z = r® and y = ln İ(r =r) draw a tangent line to the curve of In L’(r). 
The intersection of the tangent line to the z-axis is used as the new initial value in 
the next iteration, or as the maxima when two succeeding values are close enough. 
From figure 2.3, it can be seen intuitively that Newton’s algorithm can converge to 7 
only after several iterations when 7 is close to or slightly smaller than 7. When the 
estimated 7 is small, the algorithm may never converge to 7 when the initial value is 
greater than 0.2. If this is the case, the initial value should be reduced, by half for 
example, and re-start the iteration. Previous studies (Sun et al., 2012) indicated that 
smaller initial values, such as r°) = 0.01 or 0.001, can effectively avoid the 
non-convergence of the algorithm. Of course, for larger estimated recombination 
frequency, smaller initial values will take more steps to reach the solution. 


First order derivitive of 
the log-likelihood function 


InL\(r) 


Ink) 
İnk (70) 
O 


+» 


(0) “0 vi) 


Fic. 2.3 — Schematic representation of the Newton algorithm in calculating the maximum 
likelihood estimate of recombination frequency. 


Sometimes, initial values can be properly chosen from the observed genotypic 
numbers. Taking the Fə population as an example, the 9 observed genotypic numbers 
are n;(i = 1,2,...,9) with the same genotypic order as given in table 2.5. When double 
heterozygotes are not considered, the total sample size is m — ns, which contains a 
number of 2(n — ns) haploids. When one locus is heterozygous, i.e., AABb, AaBB, 
Aabb, and aaBb, each individual carries one recombinant haploid and one parental 
haploid. In homozygotes AAbb and aa BB, both haploids are recombinant types. In 
homozygotes A A BB and aabb, both haploids are parental types. Therefore, the initial 
value as given in equation 2.29, i.e., proportion of recombinant haploids in selected 
genotypes, would be not too far away from the estimated recombination frequency and 
therefore improves the efficiency of the algorithm. 


(0) — mə + 2ng + ng + ne + 2nz + ng 


0) (2.29) 
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2.8.5 EM Algorithm in Estimating Recombination 
Frequency in Fə Populations 


As briefly mentioned in §2.3.2, the expectation and maximization (EM) algorithm 
does not need the deduction of derivatives. Sometimes, the EM algorithm can be 
used in recombination frequency estimation as well. By the way, as an effective and 
widely-adopted iterative algorithm in analyzing incomplete data, the EM algorithm 
is also frequently used in gene mapping, which can be found in chapters 4-8. To be 
simple, given initial values on the parameters to be estimated, the EM algorithm 
converts the incomplete data to become complete and then applies the complete 
data analysis methods, which are always much simpler and easier, to acquire new 
initial values for the next step of the iteration. 

Take two co-dominant markers and the F» population as an example. From 
genotypic data of the Fy population as shown in figure 1.8 in chapter 1, the observed 
numbers of nine genotypes at M01 and M02 are given in table 2.15. From exercise 1.4 
in chapter 1, the two markers are known to have a significant linkage relationship. 
Expected frequencies in column 3 come from table 2.5, and the overall population size 
is n. Given an initial value on recombination frequency, the EM algorithm calculates 
the expected numbers of 10 genotypes, i.e., separation of the incomplete genotype 
AaBb into two complete genotypes AB/ab and Ab/aB, where the allelic linkage 
relationship is clearly known. Based on the numbers of the 10 genotypes, the pro- 
portion of recombinant haploids can be calculated and then used as new initial values 
for the next step. So the main task here is to determine the numbers of complete 
genotypes AB/ab and Ab/aB from the ns individuals with genotype AaBb. 

In homozygotes AABB and aabb, both haploids are parental types, or the pro- 
portion of recombinant haploids is 0. In homozygotes AAbb and aaBB, both haploids 
are recombinant types, or the proportion of recombinant haploids is 1. When one locus 
is heterozygous, i.e., AABb, AaBB, Aabb, and aaBb, each individual caries one 
recombinant haploid and one parental haploid, or the proportion of recombinant 
haploids is 0.5. Double heterozygote AaBb is less simple, which is incomplete data in 
this example. In the F» population, genotype AaBb consists of two linkage phases, i.e., 
AB/ab and Ab/aB, with expected frequencies 1(1 — r)” and 17?, respectively (see 
table 2.3). Phase AB/ab contains two parental haploids, or the proportion of recom- 
binant haploids is 0; phase Ab/aB contains two recombinant haploids, or the pro- 
portion of recombinant haploids is 1. Individuals having each of the two linkage phases 
are called complete data, as from which parental and recombinant haploids can be 
clearly determined. Given an initial value, which is treated to be the true recombi- 
nation frequency, the proportion of the repulsive linkage phase Ab/aB can be given in 
equation 2.30, which is also equal to the proportion of recombinant haploids in double 
heterozygote AaBb. 

r? r? 


7 
P A Bb — 2 — 
TİRİ4a591— oppor) 1— 2-20) 


(2.30) 


Based on previous discussions, the proportion of the recombinant haploids 
in each identifiable genotype can be caleulated and given in column 4 in table 2.15. 
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Tas. 2.15 — EM algorithm in estimating recombination frequency between two co-dominant 
markers in F» population. 


Genotype Observed Expected Frequency of Expected 
sample size frequency recombinant haploid sample size 
1 
AABB nı = 23 ac r)” Pr{ RIG} — 0 24.75 
1 1 
sla 
AAbb ng = 0 ue Pr{R|G3} = 1 0.04 
1 
1 
AaBB ng = 2 əTü —r) Pr{R|Gy} = 3 1.96 
1 7 2 
AaBb ns = 50 sila 2r") a r 49.58 
. 2. xe 
1 
Aabb ng = 3 gr —r) Pr{R| Go} = . 1.96 
1 
aaBB m” —0 7” Pr{R|G7} = 1 0.04 
1 
1 
aaBb ng =3 gr -= r) Pr{R|Gs} = 5 1.96 
1 
aabb nə = 26 2” r)” 0 24.75 


EM algorithm as applied in estimating recombination frequency between two 
co-dominant markers in Fə is summarized below. 

Expectation step (E-step): Given an initial value on recombination frequency, 
calculate the expected proportion of recombinant haploid in each identifiable 
genotype, i.e., column 4 in table 2.15, represented by Pr{ R| G;}. The initial value can 
take rọ = 0.25 for example. Obviously, the expected sample sizes on recombinants in 
each identifiable genotype can be calculated by multiplying column 2 and column 4, 
explaining why this step is called expectation. 

Maximization step (M-step): Based on expected sample sizes on recombinants 
from E-step, the recombination frequency is updated by, 

n-— )) mPARG)} = 2.:..: arse pik iL (2.31) 


i=1,2,....9 


where 52 = Pr{ R|AaBb} x nş = ə ns is the expected sample size of coupling 
linkage phase Ab/aB, given the recombination frequency r. In fact, equation 2.31 is 
exactly the MLE of recombination frequency given the 10 genotypic numbers in the 
Fə population, explaining why this step is called maximization. The proof is left to 
readers (see exercise 2.9). Take the updated value as a new initial value and repeat 
from E-step until a predefined precision is reached, such as |r; — rol < 1077. 

Using the observed genotypic numbers as given in table 2.15, the estimated 
recombination frequencies during the EM iterative procedure are given in table 2.16, 
starting from three initial values 0.01, 0.25, and 0.5. An estimate of 0.0381 was 
acquired after 5 steps regardless of the difference in initial values, indicating that the 
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EM algorithm can converge to the maxima rather quickly, and is less dependent on 
initial values. Obviously, derivatives are not needed in the EM algorithm. 

Taking the final iterative value 0.0381 as MLE of r to calculate the expected 
genotypic frequencies (i.e., column 3 in table 2.15), the expected sample sizes of 
genotypes can be acquired and shown in the last column of table 2.15. Recombinant 
homozygotes AAbb and aaBB have the expected number of 0.04, and therefore it is 
not strange that both genotypes were not observed in the F» population. Double 
heterozygote AaBb has the expected number of 49.58, while 50 individuals were 
observed. By using equation 2.30, the expected number of coupling phase AB/ab 
can be found to be 49.42. That is to say, only 0.16 out of the 107 Fə individuals or 
49.58 double heterozygous individuals are expected to belong to the repulsive phase. 
Therefore, the 50 individuals observed to be double heterozygous are most likely 
coming from the combination of two parental haploids. The extremely low number 
of coupling linkage phases in the Fy population also indicates that the initial value 
given by equation 2.29 could be very close to the estimated value. Using genotypic 
numbers given in table 2.15, an initial value is equal to 0.0702 from equation 2.29, 
close to 0.0381 by the EM algorithm (table 2.16). If used, both Newton and EM 
algorithms can converge in a couple of steps. 


TAB. 2.16 — Recombination frequency estimated by EM algorithm under three initial values. 


Initial value Iteration 

1 2 3 4 5 6 
0.01 0.0374 0.0381 0.0381 0.0381 0.0381 0.0381 
0.25 0.0841 0.0413 0.0382 0.0381 0.0381 0.0381 
0.5 0.2710 0.0941 0.0424 0.0383 0.0381 0.0381 


EM algorithm may be used also for other marker categories in the F population, 
where the identifiable genotypes and their theoretical frequencies have been given in 
tables 2.6—2.10 (see exercise 2.5 for one co-dominant locus and one dominant locus). 
But for populations where two or more generations of meiosis occur, such as F3, 
BC,F2, BCF, and BC Fo, the proportion of recombinant haploids in each genotype 
cannot be easily identified in the E-step, and the EM algorithm cannot be imple- 
mented properly. In addition, derivatives are still needed if the variance of the 
estimate is to be estimated. 


2.38.6 Effects on the Estimation of Recombination 
Frequency from Segregation Distortion 


When the observed genotypic frequencies cannot be fitted by their expected 
Mendelian ratio, segregation distortion is declared to occur at the tested locus. In 
population genetics, distortion can be explained by selection due to the difference in 
fitness (denoted by w) between genotypes at the locus. Assume two genotypes AA 
and aa each have 100 individuals. All AA individuals can survive to the adult stage 
and can pass their genes to the next generation. For genotype aa, only 90 individuals 
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can survive the adult stage. Fitness is defined to be 1 for genotype AA, and 0.9 for 
genotype aa. In other words, genotype aa has a fitness of 0.9 relative to genotype 
AA. When more than two genotypes are considered at one locus, fitness is defined in 
comparison with the genotype having the highest surviving rate, taking values 
between Ü and 1. In population genetics, 1 — w is called the selection coefficient 
denoted by s (Wang, 2017; Falconer and Mackay, 1996). 

The phenomena of segregation distortion have been frequently observed in 
genetic populations. Due to linkage, distortion at one locus can also cause distortion 
to other linked loci or markers. Distortion from the expected Mendelian ratio will 
change genotypic frequencies, and therefore affects population structure and genetic 
variance. By intuition, distortion will have effects on gene mapping (not always 
negative; see §10.5 for more information). But its effects on recombination fre- 
quency estimation are rather limited. DH population will be used below for an 
illustration. Assume the finesses are 1 and 1 — s for genotypes AA and aa, 
respectively, and 1 for both genotypes BB and bb. Genotypic frequencies before and 
after selection are given in table 2.17. Let rp be recombination frequency at dis- 
tortion, which is also defined by the proportion of recombinant homozygous 
genotypes in the DH population, i.e., equation 2.32. 


=n es s) 
rp = 22 — —-r (2.32) 
zT S) 


The selection at the A/a locus reduces frequencies of recombinant type aaBB and 
parental type simultaneously but does not change the overall frequency of two 
recombinant types AAbb and aaBB. By theory, rp at distortion is identical to r at no 
distortion (equation 2.32). 


TAB. 2.17 — Genotypic frequencies under selection, where s is the selection coefficient of 
genotype aa relative to genotype AA; no selection occurs in the other locus. 


Genotype Non-distorted frequencies Selection coefficient Frequencies after selection 


1 
AABB 20-9) 1 170 
AAbb : 1 : 
2” 2" 
1 1 
aaBB a” s ruc — s) 
1 1 
aabb gut) s z0 nU- s) 
1 
Sum 1 26? —s) 


Using the genotypic numbers given in table 2.11, several levels of selection are 
considered at marker Act8A. Estimated recombination frequencies are shown in 
table 2.18. It can be seen that even the selection coefficient of aa is equal to 1, the 
estimated value is still close to the estimate under no selection. 


TAB. 2.18 — Re-estimation of recombination frequency between two markers Act8A and OP06, assuming the fitness values of genotypes AA 


and aa at marker Act8A are 1 and 1 — s, respectively, where s is called selection coefficient in population genetics. 


Marker Act8A Marker OP06 
AA BB 

AA bb 

aq BB 

aq bb 


Number of recombinants 
Total sample size 
Estimated recombination frequency 


s=0 
64 


s=0.5 


64 


s = 0.75 
64 


s=1 
64 
8 
0 


0.1111 


v6 
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Equation 2.32 gives the theoretical results in an infinite population. In practice, 
when genetic populations are large enough, for example having 200 or more indi- 
viduals or families, effects from the less-severe segregation distortions can be prob- 
ably ignored without significantly affecting the accuracy of recombination frequency 
estimation and the followed linkage map construction. Even though similar esti- 
mates are observed in table 2.18, population size is reduced by the selection, which 
will reduce the accuracy of the estimate (i.e., the estimate will have a larger sam- 
pling variance). In addition, a large number of markers are included in one linkage 
group. Estimating errors on pair-wise recombination frequencies may accumulate 
and affect the quality of the genetic linkage map to some extent. In summary, 
distortion has limited effects on linkage analysis and map construction in large 
genetic populations, but its effect on gene mapping sometimes can be significant (see 
§10.5 for details). During the development of genetic populations, distortion should 
still be avoided as much as possible, in addition to the considerations which have 
been discussed in §1.1.3, chapter 1. 


Exercises 


2.1 Assume two loci are linked on one chromosome with recombination frequency r, 
genotypes of two homozygous parents are AABB and aabb. Work out the theoretical 
frequencies of four homozygotes in the DH population derived from Fo, i.e., Fo-DH. 
Assume again the observed numbers of F2-DH lines are 64, 8, 7, and 61 (same as 
table 2.11) for four homozygotes AABB, AAbb, aaBB, and aabb, respectively. Work 
out the MLE of recombination frequency between the two loci. 


2.2 Let m-nəp represent the observed numbers of 9 genotypes in an Fə population, 
with theoretical frequencies given in table 2.5. Log-likelihood function is given 
below. 


İn L x (2n + n + 4 + ne + ng + 2n9) ln(1 — r) + ns In(1 — 2r + 2r?) 
+ (nə + 2ng + ny + ne +27 + ng) Mr 


Confirm the following first-order and second-order derivatives on the 
log-likelihood function. 


dinL 2m+mt+tm+n+ng+2ng mə(2 — 4r) 
dr l-r 1—2r+2r? 


dk  2m+m+n tn tn +2  miidr — 4?) 
dr? (1 — r)” — (1-— 2r-- 272)” 
mə + 2ng + ny + Ng + 2nz + Ng 


r2 


2.3 In one Fy population in soybean derived from two homozygous cultivars, 
genotyping is conducted on 60 individuals and two parents. Let 2 and 0 represent 
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two parental genotypes, 1 for the F) genotype, and —1 for missing. Given below are 
genotypic data at two co-dominant markers, i.e., *Satt521 and *Satt549. Assume 
that codes 2, 1 and 0 at *Satt521 represent genotypes AA, Aa, and aa, respectively; 
codes 2, 1, and 0 at *Satt549 represent genotypes BB, Bb, and bb, respectively. 


Marker Individuals 1-20 

*Satt521 2 0 2 1 1 1 1 1 0 1 2 0 O 1 1 1 01 0 2 

*Satt549 2 0 2 1 1 -1i 1 0 0 0200 1 1 101 1 2 
Individuals 21—40 

*Satt521 1 2 1 1 1 0 1 2 0 1 2 0 0 0 1 0 2 0 0 

*Satt549 1 2 1 1 1 0 1 2 0 1 2 0 0 0 1 2 0 1 0 -1 
Individuals 41-60 

*Satt521 1 1 2 0 1 0 -1 2 1 1 1 1 2 0 1 0 0 2 1 

*Satt549 2 1 0 0 1 0 -1 2 1 O 1 1 1 O - 00211 


(1) Work out the observed number of individuals having each of the nine genotypes 
at *Satt521 and *Satt549. 

(2) Work out the MLE of recombination frequency between *Satt521 and *Satt549 
by Newton algorithm from the nine genotypic numbers obtained in (1). 


2.4 Given below are sample sizes of nine genotypes observed at two co-dominant loci 
in an Fp population (same as exercise 2.3). Work out the MLE of recombination 
frequency between the two co-dominant loci by EM algorithm. 


Genotype AABB AABb AAbb AaBB AaBb Aabb aaBB aaBb aabb 
Sample size 10 2 1 1 21 3 0 1 17 


2.5 Observed numbers on 6 identifiable genotypes given in table 2.13 are re-arranged 
in the following table. Column 4 gives the expected frequency of recombinant 
haploids in each identifiable genotype. The total population size is n = 2341. 


Identifiable Observed number Theoretical Expected frequency 

genotype frequency of recombinant haploids 
1 1 ae 

AAB_ m = 572 2 — r) ır” 
l o 

AAbb Ny = 3 a" p = 0 
1 r(1+r) 

— =(l-r+r = 

AaB_ na = 1161 5 ( ) =e re 
1 1 

Aabb na = 22 za- pis 
1 1 

aaB_ ns = 14 yen m-ə 
1.09 

aabb ne = 569 41” pe = 0 
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Take identifiable genotype AAB_ as an example to illustrate the calculation of 
column 4 in the table. It can be seen from table 2.3 that AAB_ is a mixture of two 
genotypes AABB and AABb, having expected frequencies 1 — r)” and İr(1 — r), 
respectively, with a total frequency 1(1 — r?) in an F, population. Therefore, the 
expected number of AAB_ is equal to 1(1 — r°)n, and the expected number of 
haploids is equal to 2 (1 — r?)n. Both haploids in AABB belong to parental type. 
One haploid in AABbd is the parental type, and the other one is recombination 
type. Therefore, the expected number of recombinant haploids is equal to 
3r(1 — r)n. So the expected proportion of recombinant haploids in 1(1 — r?)n 
gr(l—r)n 2 
3(1—r2)n 14 
Based on frequencies given in column 4, total number of recombinant haploids in 


individuals with genotype AAB_ is equal to i.e., pı in the table. 


number of haploids is 2n. Given an initial value of rọ, frequencies in column 4 can 
be acquired, based on which the recombination frequency can be updated by 


iteration. The procedure described above is actually the EM algorithm to estimate 
the MLE of recombination frequency between one co-dominant marker and one 
dominant marker in the Fy population. 


(1) Work out the expected proportion of recombinant haploids in genotype AaB_. 

(2) Work out the expected proportion of recombinant haploids in genotype aaB_. 

(3) Given an initial value of 0.25, calculate the estimate of recombination frequency 
after 10 iterations by the EM algorithm. 

(4) Given an initial value of 0.10, calculate the estimate of recombination frequency 
after 10 iterations by the EM algorithm. 


2.6 In table 2.13, assume the molecular marker is also dominant. Observed num- 
bers and expected frequencies of the four identifiable genotypes are given below. 


Identifiable genotype A_B_ A_bb aaB_ aabb 

Sample size m = 1733 n = 25 ng = 14 n4 = 569 
1 1 2 Ii. 2 1 1 1 : 

Expected frequeney 2 4 (1 ry 44 (1 r}? 4 4 (1 ry 4 (1 r” 


Let 0 = (1 — r)”, and n,—n, represent the four observed numbers. Log-likelihood 
on 6 can be vritten as, 


İn L x m In(2+ 0) + (m+ nə) ln(1 — 6) + ny Ind 


Confirm that the MLE of 6 is given by, 


—(2n — 3m — m4) + Van — 3m — na)” +8n x ng 
2n 


g= 
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It is well-known in statistics that if 0 is MLE of 0, and g(0) is one monotonous 


function on 0, g(@) is MLE of g(0). Based on this statement and the relationship 


0=(1— r)”, work out the MLE of recombination frequency between the dominant 
marker and the dominant disease-resistant gene. 


2.7 In exercise 2.6, assume the selection coefficient of genotype aa is equal to 0.5 
relative to genotype A_. Use the genotypic numbers after selection to re-calculate 
the MLE of recombination frequency. 


2.8 Accumulated recombination frequency after a number of generations of selfing 
was given in equation 2.7. Markov chain theory has to be applied to work out that 
relationship. Assume that the repeated selfing starts from the F; generation with 
genotype AB/ab. Due to the equal allele frequencies at both loci, two parental 
homozygotes AA BB and aabb have equal frequencies, two recombinant homozygotes 
AAbb and aaBB have equal frequencies, and four genotypes AABb, aaBb, AaBB, 
and Aabb have equal frequencies in any generation of the repeated selfing. For 
convenience, genotypes with equal frequencies are merged into one type, or one state 
by the terminology in the Markov chain. Two linkage phases AB/ab and Ab/aB have 
different frequencies, which have to be treated separately as types or states 4 and 5. 
Therefore, the 10 X 10 transition matrix as given in equation 2.4 can be re-arranged 
into one 5 X 5 matrix, i.e., 


1 0 0 0 0 
0 1 0 0 0 
1 1 1 
-= -= = 0 0 
pal 2 4 2 
1 1 1 
g(t — r)? at 2r(1 — r) z0 - r} r? 
1 1 1 
3 cee 2r(1 — r) -r? 210 


(1) The transition matrix can be partitioned as follows. 


T= | I2x2 ) 
Rsx2 Qaxs 


Confirm the inverse matrix of (I — Q) as given below, 


2 0 0 
8r(1 — r) 2(1 FE 2r — r?) 2r? 
(1— Q) 1 = |14+2r—2r? (1--2r)(1--2r — 277) (1+2r)(1+2r-— 2r?) 
8r(1 — r) 2r? 2(1 --2r — r?) 
1+2r— 2r? (14-2r)(1 EF 2r — 27?) (14+ 2r)(1+2r-— 2r?) 


Estimation of the Two-Point Recombination Frequencies 99 


(2) During the repeated selfing, the change in frequencies of the five states can be 
modeled by the Markov chain. In Markov chain theory, types 1 and 2 are called 
absorbing states; i.e., once reached, individuals will remain in the state. Types 
3, 4, and 5 are called transient states. In theory, it can be proved that proba- 
bility from transient state £ + 2 (i = 1, 2, 3) to absorbing state j (j = 1, 2) after 
a long time (e.g., more than six generations of selfing) is equal to element 
(i, j) in matrix (I — Q) "R. Confirm (I — Q)'R as given below, 


7 oe 
fs 
(I-Q) "R= |142r 1+2r 
2r 1 

14+2r 1+2r 


(3) Genotype AB/ab belongs to state 4. After an infinite number of selfing genera- 
tions, probabilities of A B/ab to enter two absorbing states are elements (2, 1) and 


(2, 2) in matrix (I — Q) Rh, ie., nz and ce respectively. State 2 represents 


the two recombinant homozygotes, and therefore the probabilities of AB/ab to 
enter state 2 is in fact the accumulated recombination frequency, i.e., R = ; 2r 


2.9 Assume the 10 genotypes at two co-dominant loci with recombination frequency 
r can be observed in an Fə population, and the two parents have genotypes AABB 
and aabb. Observed sample sizes are represented by ns; and nsz for two linkage 
phases AB/ab and Ab/aB having expected frequencies 1(1 — r)” and 177, respec- 
tively. Sample sizes and expected frequencies for other genotypes are the same as 
given in table 2.15, and the total population size is n. Show that MLE of r can be 
clearly written by, 


mə + 2ng + ny + 2nşə + ng + 2N7 + Ng 
2n 


r= 


where the numerator is in fact the number of recombinant haploids observed in the 
Fə population, and a denominator is the total number of haploids carried by the 
n individuals. 


Chapter 3 


Three-Point Analysis and Linkage Map 
Construction 


The construction of genetic linkage maps is a classical and also a very important 
issue in genetics almost for every species. Linkage maps provide the opportunity to 
visualize the relative positions of markers and genes, together with their genetic 
distances and linkage relationships on chromosomes in the considered species 
(Hartl and Jones, 2005; Bailey, 1961; Kempthrone, 1957). Distance between two 
loci on the linkage map is represented in centi-Morgan (cM), which can be 
acquired from the estimation of recombination frequency. Map distance in 1 cM is 
equivalent to a recombination frequency of 1%. On linkage maps, relative positions 
between genes on important phenotypic traits and molecular markers at the DNA 
sequence level can be clearly observed; either can the chromosomal locations of 
these genes, such as short arm, long arm, centromere, and telomere of the chro- 
mosome. Linkage maps provide the underlying information for other genetic 
studies as well, e.g., gene mapping, gene fine-mapping, map-based cloning of genes, 
and marker-assisted selection in breeding. The first linkage map was constructed 
for the X chromosome in the fruit fly, using six morphological traits as markers 
(Sturtevant, 1913). Linkage maps constructed nowadays can easily have hundreds 
or even thousands of markers. Map construction can be roughly classified into two 
steps, i.e., marker grouping and marker ordering. More markers on the chromo- 
some will increase the resolution of the constructed map. However, a large number 
of markers also cause problems to the construction algorithms, such as the running 
time being long, some linked markers may be wrongly grouped, and some closely 
linked markers may be wrongly ordered. In addition, errors in genotyping, even 
though low, can cause other kinds of problems with construction algorithms. 
Therefore, the high-quality and high-density genetic maps together with their 
construction methods are always hot topics in genetics (Zhang et al., 2020; 
Mollinari et al., 2009; Hackett and Broadfoot, 2003; Mester et al., 2003; Buetow 
and Chakravarti, 1987; Lander and Green, 1987; Lander et al., 1987; Weeks and 
Lange, 1987; Haldane, 1919). 
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8.1 Three-Point Analysis and Mapping Function 


3.1.1 Genetic Interference and Coefficient of Interference 


For three linked loci represented by Mı, Mə, and Mə, three pair-wise recombination 
frequencies can be estimated in one genetic population by the theories and methods 
introduced in chapter 2. Let rs, rəş, and riş be recombination frequencies between 
M: and Mə, between Mə and Ms, and between M; and Ms, respectively. The relative 
order of the three loci can be roughly determined from the estimates of three 
recombination frequencies. For example, if the estimate of rış is larger than both 
estimates of rış and 193, the order of the three loci would be M:-M5-M5, i.e., locus 
Mə is located between the other two loci. 

Assume the order of three loci is given by Mı—-Mə—M;, and the interference on 
crossing-over does not occur between chromosome intervals denoted by M:-Mə and 
Mə—M;, or crossing-over happens independently in the two intervals. In this case, 
three pair-wise recombination frequencies have the relationship given in 
equation 3.1. The left side in equation 3.1 can be viewed as the probability that no 
crossing-over happens between M; and M3. The right side can be viewed as the sum 
probability of two exclusive events: (1) no crossing-over happens between M; and Mə 
and no crossing-over happens between Mə and M3; (2) one crossing-over happens 
between M) and Mə and the other one happens between Mə and M3, i.e., two 
crossing-over events happen between M, and M3. The sum of the two exclusive 
events is identical to no crossing-over event that happened between M: and Mə, 
resulting in equation 3.1. 


(1 ig) = (1 — naltt — r23) ras (3.1) 


Equation 3.2 can be immediately acquired from equation 3.1. In fact, the middle 
term in equation 3.2 can be also viewed as the sum probability of two exclusive events: 
(1) one crossing-over happens between M; and Mə, but no crossing-over happens 
between Mə and Mə: (2) no crossing-over happens between M) and Mə, but one 
crossing-over happens between My and M3. The sum of the two exclusive events is 
identical to one crossing-over event that happened between M, and M3. Obviously, 
recombination frequency is not additive under the assumption of independent 
crossing-overs. 


Tiş = ni?(l — məş) + (1 — ne) 123 = Te + 123 — 212193 (3.2) 


When complete interference occurs, i.e., the crossing-over that happened in one 
interval will exclude the crossing-over to happen in the other interval, the three 
pair-wise recombination frequencies have the relationship in equation 3.3. The left 
side in equation 3.3 can be viewed as the probability that one crossing-over happens 
between M; and Mə. The right side can be viewed as the sum probability of two 
exclusive events: (1) one crossing-over happens between Mı, and Mə: (2) one 
crossing-over happens between Mo and M3. Obviously, recombination frequency is 
additive under the assumption of complete interference. 
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r13 = T12 + 13 (3.3) 


In general situations, crossing-overs occur during meiosis between the homolo- 
gous chromosomes when they are paired maybe somewhere between the completely 
independent and complete interference and 6 is used to represent the coefficient of 
interference, that is, 


T13 = Ti + T23 — 2(1 — Ô) 12723 (3.4) 


It can be seen from equation 3.4 that when 6 = 0, equation 3.4 becomes 
identical to equation 3.2. Therefore, 6 — Ü represents the crossing-over that 
happens independently in two intervals. When 6 = 1, equation 3.4 becomes 
identical to equation 3.3. Therefore, 6 = 1 represents the complete interference. 
When the order of markers is confirmed to be M,;—M»)—-Ms, the coefficient of 
interference can be estimated by equation 3.5 from the three pair-wise recombi- 
nation frequencies. 


T12 + T23 — T13 
2ri2T23 


ô=1 (3.5) 

Interference is an important genetic phenomenon of crossing over between 
homologous chromosomes when they are paired during meiosis. The coefficient of 
interference is normally valued between Ü and 1. In other words, double 
crossing-overs observed in interval M:-M: are lower than the expected frequency 
T\2% 3 under independency. In some special situations, coefficients above one or 
even negative have been observed. Coefficients of interference take different values 
among different species, among different chromosomes in one species, and in dif- 
ferent regions of one chromosome (Lam et al., 2005; Broman et al., 2002; 
Copenhaver et al., 2002; Kosambi, 1944). The accuracy of the estimate of 
recombination frequency depends on a number of factors, e.g., type of the popu- 
lation, size of the population, marker category, distortion, and missing values (see 
§3.3 for some factors). In one simulated population as used in exercise 3.3, inde- 
pendent crossing-over is assumed. It can be seen that the estimated coefficients of 
interference in some regions have a large bias from 0. The accurate estimation of 
the coefficient of interference is based on the accurate estimation of recombination 
frequencies in suitable populations. 

Table 3.1 gives the pair-wise estimates of recombination frequency for 14 markers 
in the barley DH population. Take the first three markers as an example. Estimate of 
the pair-wise recombination frequency was 0.107 between Act8A and OP06, 0.076 
between OP06 and aHor2, and 0.111 between Act8A and aHor2. From the three 
estimates, OP06 should be ordered between Act8A and aHor2. From equation 3.5, 
the coefficient of interference is estimated at 6 = —3.422, indicating a negative 
interference may occur between interval Act8A-OP06 and interval Act8A-aHor2, or 
the frequency of double crossing-over between Act8A and aHor2 may be higher than 
the expected frequency under independency. Take markers 5-7 as another example. 
The estimate of pair-wise recombination frequency was 0.184 between ABG464 and 
Dor3, 0.036 between Dor3 and iPgd2, and 0.214 between ABG464 and iPgd2. 


Marker 
OP06 
aHor2 
MVVG943 
ABG464 
Dor3 
iPgd2 
cMWG733A 
AtpbA 
drung 
ABC261 
ABG710B 
Aga7 
MVVG912 


TAB. 3.1 — Pair-wise estimates of recombination frequencies for 14 markers in the barley DH population. 


Act8A 
0.107 
0.111 
0.419 
0.475 
0.457 
0.438 
0.451 
0.437 
0.500 
0.483 
0.493 
0.479 
0.464 


OP06 aHor2 MVVG943 ABG464 Dor3 


0.076 
0.429 
0.485 
0.460 
0.468 
0.482 
0.482 
0.532 
0.507 
0.525 
0.504 
0.489 


0.419 
0.458 
0.459 
0.419 
0.448 
0.455 
0.529 
0.511 
0.530 
0.515 
0.481 


0.128 
0.308 
0.321 
0.370 
0.390 
0.467 
0.441 
0.496 
0.504 
0.504 


0.184 
0.214 
0.283 
0.304 
0.436 
0.410 
0.475 
0.500 
0.529 


0.036 
0.101 
0.122 
0.262 
0.236 
0.317 
0.355 
0.400 


iPgd2 cMWG733A AtpbA 


0.070 
0.105 
0.241 
0.222 
0.294 
0.331 
0.376 


0.036 
0.175 
0.155 
0.227 
0.266 
0.317 


0.133 
0.113 
0.184 
0.224 
0.273 


drun8 ABC261 ABG710B Aga7 


0.049 
0.105 
0.145 
0.192 


0.070 
0.111 
0.171 


0.035 
0.094 


0.057 


VOT 
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Dor3 should be ordered between ABG464 and iPgd2. From equation 3.5, the coef- 
ficient of interference is estimated at 6 = 0.617, indicating a positive interference 
may occur between interval ABG464-Dor3 and interval Dor3-iPgd2, i.e., frequency 
of double crossing-over between ABG464 and iPgd2 may be lower than the expected 
frequency under independency. 


3.1.2 Mapping Function 


As shown previously, recombination frequency is not additive in most situations 
(equation 3.4). However, in practice it is more convenient to request the genetic dis- 
tance to be additive. Assume three loci have the order M;—-M»—Mz3 in one chromosome, 
and mə, məş, and m 3 are genetic distances between M; and M3, between Mo and M3, 
and between M; and Mə, respectively. Equation 3.6 gives the additive relationship of 
the three distances, which are also called genetic linkage map distances, or map dis- 
tances in short. Unit of map distance, i.e., M, is named after Thomas H. Morgan, a 
famous geneticist. However, centi-Morgan (cM) is used in most cases, where 
1 M = 100 cM. 


müa = M12 + M23 (3.6) 


Recombination frequency is estimated firstly from linkage analysis. Map dis- 
tance is a function of recombination frequency, i.e., m = f(r), which converts the 
non-additive recombination frequency into the additive distance and is therefore 
called the mapping function. Recombination frequency r = 0.01 corresponds to a 
map distance of 1 cM. In most species, each chromosome has a length from tens to 
hundreds of cM. Several mapping functions have been proposed in order to cal- 
culate the additive map distance. These functions are based on different 
assumptions regarding the degree of interference. Three of them will be introduced 
below. 


1. Morgan mapping function 


Morgan mapping function takes the percentage in recombination frequency as 
map distance, i.e., m = f(r) = 100 x rin unit cM. For two neighboring intervals, 
the total length is equal to the sum length of the two intervals. For example, for one 
chromosomal interval defined by three ordered markers, i.e., M.-—Mə-Mə, the 
recombination frequency between M: and MA is 0.02, and the map distance is 2 cM; 
the recombination frequency between Mə and Ma is 0.03, and map distance is 3 cM. 
Based on the Morgan mapping function, the map distance between M: and Ma is 
equal to 5 cM, or the length of interval M,—M3 is 5 cM. Morgan mapping function 
does not take the double crossing-over into consideration, or the interference is 
complete, i.e., 6 = 1. In fact, two or more times of crossing-overs can happen when 
the chromosomal region is long, making the recombination frequency to be 
non-additive. Therefore, this function cannot be used for large chromosomal 
intervals. 
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2. Haldane mapping function 


Haldane mapping function is built on the assumption of no interference 
(Haldane, 1919), i.e., equation 3.7, where the unit of map distance m is M. 


m=f(r)= - fi — 2r) or r —-(1-—e”””) (3.7) 


N| = 


For three ordered markers, i.e., Mi-Mə-Mə, when crossing-over happens inde- 
pendently in interval Mı—M;,, and interval Mə—M;, the relationship of three pair-wise 
recombination frequencies given in equation 3.2 can be re-arranged as 1 — 2riş = 
(1 — 2rj2)(1 — 2793) (see exercise 3.2). The distance thus acquired by equation 3.7 
becomes additive. If unit cM is used, the Haldane mapping function can be 
expressed by equation 3.8. 


m = f(r) = —501n(1 — 2r) or r — 2(1 — e 7/90) (3.8) 


Nl = 


3. Kosambi mapping function 


Kosambi mapping function takes interference into consideration, i.e., interfer- 
ence is stronger in shorter chromosomal intervals but weaker in longer intervals 
(Kosambi, 1944). Assume the relationship between coefficient of interference and 
recombination frequency is given by 6 = 1 — 2r, the Kosambi mapping function 
given in equation 3.9, where the unit of map distance m is M. 


1, 1+2r lem —1 
—” a eT or "= zom} (3.9) 


As an exercise for readers, equation 3.4 can be re-arranged into 442% — 


1—2rış 
Xİ x 5 assumin, — 1 — 2r (see exercise 3.2). erefore e distance 
re x Ltrs ö—1—2 3.2). Theref the dist 
—2ri2 1-273 ? 


acquired by equation 3.9 is additive. If unit cM is used, the Kosambi mapping 
function can be expressed by equation 3.10. 
1+2r 167/2—1 


OT 


= 251 x. 
02. Yo a 


(3.10) 


For the three mapping functions mentioned above, Haldane and Kosambi 
functions are more commonly used. Given the recombination frequency, the map 
distance is the largest from the Haldane function and the smallest from the Morgan 
function (figure 3.1). When recombination frequency r < 0.05, the three functions 
give similar distances. Therefore, for high-dense linkage maps, similar map distance 
by intervals and similar total map length can be acquired, regardless of the mapping 
functions. 


Three-Point Analysis and Linkage Map Construction 107 


200 5 
= -- Morgan mapping fucntion 
S Haldane mapping function 
= 1507 .---- Kosambi mapping function 
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Recombination frequency 


Fic. 3.1 — Comparison of three mapping functions. 


8.2 Construction of Genetic Linkage Maps 


Two major steps, i.e., grouping and ordering, are included in linkage map con- 
struction. By grouping, the markers which are linked with each other on one chro- 
mosome are classified into one linkage group. When polymorphism markers used in 
genotyping could cover the whole genome, the ideal number of groups should be 
equal to the number of chromosomes in the studied species. By ordering, the relative 
order is determined for all markers classified into one linkage group. The ideal order 
of markers should be exactly the same as their physical order on the parental gen- 
ome. Estimate of recombination frequency and the associated LOD score as intro- 
duced in chapter 2, and map distance as introduced previously reflect the linkage 
relationship between markers. All three parameters can be used in map 
construction. 


3.2.1 Marker Grouping Algorithm 


Grouping is the first step in map construction. Without an accurate grouping, it is 
hard to imagine that a high-quality linkage map could be constructed. An ideal 
grouping should acquire a number of linkage groups, equal to the number of chro- 
mosomes. Members in each group should include all polymorphism markers located 
on the corresponding chromosome. Linkage relationship between markers can be 
seen not only from the test statistic LOD score but also from the estimate of 
recombination frequency and the converted map distance. Take the LOD score as an 
example to illustrate the grouping procedure. Given one threshold value on LOD 
score (between 2.5 and 3.0 in most cases), a number of n markers to be grouped are 
treated as one un-grouped set at first, which is represented by Gp = {M;, Mə, ..., 
Mal) where n is greater than 2. A number of k groups at the end of the grouping 
procedure are represented by a number of k non-empty sets, i.e., Gi, ..., Gz 

As the development in both genetic and physical maps, more and more molecular 
markers have known positions on physical maps or linkage maps previously 
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constructed. Markers with known linkage information ahead of grouping are called 
anchors. Utilizing anchor information which has been confirmed in previous studies 
can greatly improve the marker grouping algorithm in newly-developed populations. 
Therefore, two scenarios will be considered in the grouping, i.e., k = 0, representing 
that no anchor information can be used, and k > 0, representing a number of k an- 
chor groups that are already known from other sources. 


Scenario 1: k = 0, i.e., no anchor information can be used 


(1.1) Identify one pair of markers to be grouped first in set Gp. The two markers have 
the closest linkage relationship, which is represented by M, and M;. LOD score 
between the two markers is denoted by Dj; and defined by equation 3.11. 

D; 


1 


p = Max{LOD(M; , M,,); 7, 72 = yə 2, e.ə, ND; 71 zəl in} (3.11) 
where LOD(M,,, MA) is the LOD score between markers M, and Mə, and no is the 
number of markers in set Go. 

(1.2) If Djp is greater than the LOD score threshold, the first set, denoted by Gj, is 


generated to contain the two markers M; and Mp. Otherwise, two sets, denoted by 
G, and Go, are generated to contain each of the two markers. 


(1.3) Remove the two markers M; and Mp from set Go. 


After the three steps described above, either one set having two markers or two 
sets each having one marker will be acquired. Taking the acquired set(s) as anchor 
group(s), the grouping procedure continues as Scenario 2. 


Scenario 2: k > 0, i.e., a number of k anchor groups are known before grouping. 


(2.1) Identify one marker in set Go, represented by M,, which has the highest priority 
to be grouped. The method to identify such a marker is given below. For any marker 
in Go, ie., M; the largest LOD score denoted by C; is worked out first by 
equation 3.12. 


Ci = Max{LOD(M,, May); £ = 1,2,...k, y = 1,2,...nz} (3.12) 
where M. is the yth marker in the zth set group, n, is a number of markers in set 


group G,. One with the highest priority to be grouped is marker M; having the 
largest value in C; (i = 1, 2, ..., np), ie., 


C; = Max{ C;; i = 1,2,..., no), where no is number of markers in set Go 


(2.2) To identify one set group, represented by G;, to contain M,. Method to identify 
such a group is given below. For any existing group G, (a = 1, 2, ..., k), the largest 
LOD score denoted by D, is worked out first by equation 3.13. 
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D, = Max{LOD(M,, Gay); y = 1, 2,...ne} (3.13) 


where Ma, is the yth marker in group Gz, nsis number of markers in group G,. One 
with the highest priority to contain marker M, is the group having the largest value 
in D, (z — 1, 2, ..., k), we, 


D; = Max{D,;x« = 1,2,...,k}, where kis number of the existing groups 


(2.3) Determine whether marker M; should be included in group G;. If D; is greater 
than the LOD score threshold, marker M, is classified into group G, otherwise, a 
new group, i.e., Gz+1, is generated and marker M; is classified into group Gz+1. 
(2.4) Remove marker M; from set Go. 

(2.5) Repeat step 2.1 until Go = Ø. 


Sets Gi, Gə, ..., Gy thus acquired at the end of the procedure is the grouping 
results of n markers, and the number of sets is the number of marker groups. If the 
estimate of recombination frequency or map distance is used as the grouping cri- 
terion, maximization in equations 3.11-3.13 should be replaced with minimization; 
‘greater than’ in step 1.2 and step 2.3 should be replaced with ‘smaller than’. The 
threshold is normally set around 0.3 for recombination frequency, or between 30 and 
50 cM for map distance. 

Obviously, the number of the finally acquired groups, which depend largely on 
the LOD score threshold, are unknown before grouping. Lower threshold value 
results in fewer groups; higher threshold value results in more groups. As said 
earlier, the ideal number of groups should be equal to the number of chromosomes, 
which is known almost for all species. In practice, users can adjust the threshold 
value to acquire a desired number of groups. In addition, to avoid the uncertainty 
in marker grouping, one other ordering algorithm based on cluster analysis is 
provided in some integrated map construction and gene mapping software pack- 
ages, e.g., QTL IciMapping (Meng et al., 2015), GACD (Zhang et al., 2015c) and 
GAPL (Zhang et al., 2019). By this algorithm, the number of groups can be 
specified before grouping. 

For the 14 markers in the barley DH population as shown in figure 1.7, estimates 
of pair-wise recombination frequencies have been given in table 3.1. The upper 
triangular in table 3.2 gives the pair-wise LOD score, and the lower triangular gives 
the pair-wise map distance. The Haldane mapping function is used to convert the 
recombination frequency to map distance. For clarity, map distance 1000 cM is 
given when two markers have an estimate equal to or even greater than 0.5 in 
recombination frequency. When the LOD score threshold is set at 3 and no anchor 
information can be used, the 14 markers are classified into two groups. The first 
three make one group, and the other 11 make another group. When two markers, 
e.g., Mı and Muz, are known to be linked and used are anchors, one group is formed 
when the same threshold is used. 


TAB. 3.2 — Pair-wise LOD score (upper triangular elements) and map distance (cM) (lower triangular elements) for 14 markers in the barley 
DH population. 


Marker Mı Mə M3 M, M; Me M7 Ms Mo Mio Mu Miş Mis Mia 
M, 21.4 20.2 0.8 0.1 0.2 0.5 0.3 0.5 0.0 0.0 0.0 0.1 0.2 
Mo 10.9 24.4 0.6 0.0 0.2 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 
M; 11.3 7.6 0.7 0.2 0.2 0.8 0.3 0.2 0.0 0.0 0.0 0.0 0.0 
My 60.8 64.1 60.6 18.0 4.4 3.9 2.0 1.4 0.1 0.4 0.0 0.0 0.0 
M: 91.4 105.1 78.2 13.1 12.8 10.6 5.9 4.7 0.5 1.0 0.1 0.0 0.0 
Me 77.7 79.4 78.6 36.0 19.3 33.1 22.1 19.4 7.2 8.9 4.2 2.6 1.2 
Mz 67.7 85.3 60.8 38.1 22.9 3.6 27.3 22.2 8.8 10.2 5.4 3.7 1.9 
Mg 74.0 100.0 72.5 47.6 32.0 10.2 7.0 33.1 14.3 16.2 9.6 7.1 4.2 
Mo 67.3 100.0 76.5 52.2 35.3 12.5 10.6 3.6 18.7 21.0 13.2 10.0 6.4 
Mio 1000 1000 1000 84.6 66.9 29.1 26.3 18.3 13.6 31.2 22.2 17.6 12.5 
Mi 100.7 1000 1000 69.3 57.9 25.6 23.9 16.0 11.5 4.9 27.0 21.5 14.3 
Mio 123.7 1000 1000 139.9 91.4 37.3 33.7 24.5 19.4 10.6 7.1 33.6 23.1 
Mi3 96.3 1000 1000 1000 1000 44.3 39.8 29.6 24.1 14.9 11.3 3.5 29.1 
Mia 82.4 112.6 98.9 1000 1000 54.9 48.9 37.3 30.7 20.2 17.9 9.5 5.7 


Notes: Haldane mapping function is used to acquire the pair-wise map distance from the estimated recombination frequency given in table 3.1. 
When the estimated value is equal to or even greater than 0.5, map distance cannot be calculated, which is represented by 1000 in the table. 
M,—My,4 represent the 14 markers in table 3.1 by order. 
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3.2.2 Marker Ordering Algorithm 


One major objective in ordering is to identify the most suitable order which has the 
shortest map length for a large number of linked markers. Development and 
advances in biotechnology have led to the availability of high-throughput molecular 
markers, which allow the construction of high-density genetic maps. However, a 
large number of molecular markers also requires highly efficient algorithms for 
linkage map construction. To date, there are several approximate algorithms 
available, including seriation (Buetow and Chakravarti, 1987), maximum-likelihood 
multi-locus linkage maps (Lander and Green, 1987), evolutionary strategy algorithm 
(Mester et al., 2003), uni-directional growth (Tan and Fu, 2006), minimum spanning 
tree of a graph as implemented in software MSTMap (Wu et al., 2008), and local 
changes based on a greedy initial route as implemented in software Lep-MAP 
(Rastas et al., 2013). However, some of the algorithms have problems with high 
computing complexity and low ordering accuracy. Introduced below are algorithms 
for solving the travelling salesman problems (TSP) with applications in linkage map 
construction. 

Given the number of n cities and the distances between any two of them, a 
salesman is required to visit each city once and only once, starting from any city and 
returning to the original place of departure. Which route should he choose in order 
to minimize the total travelling distance? This problem is referred to as TSP, one of 
the most challenging and widely studied optimization problems in mathematics 
(Laporte, 1992; Lin and Kernighan, 1973; Christofides and Eilon, 1972; Lin, 1965). 
TSP is a classical problem in combinatorial mathematics that is classified as 
non-deterministic polynomial (NP) hard. Theoretically, the best route of a TSP can 
be found through the comparison of all possible solutions. However, this turns 
quickly to be impossible with the increase in the number of cities. For example for a 
number of n = 50 cities, the number of possible routes is $n! = 1.52 x 10. Com- 
puting time of the exact algorithms increases either exponentially or according to 
very high-order monomial functions. 

Therefore, various heuristic (or approximate) procedures have been developed in 
order to solve TSP with a large number of cities (i.e., several hundred to thousands 
of cities), which could produce answers close to the optimal solution. The best one 
among these approximate procedures is the k-optimal algorithm (Laporte, 1992; 
Lin and Kernighan, 1973). Two steps are involved in the k-optimal algorithm, i.e., 
route construction and route improvement, which will be introduced below in detail. 


1. Construction of the initial routes 


An initial route is needed for the second step of route improvement. A number of 
algorithms can be used to construct the initial route, among which the nearest 
neighboring (NN) algorithm (also called the greedy algorithm) is commonly used 
and will be introduced here. The NN algorithm starts from a route containing two 
makers with the shortest distance in one marker group. One is treated as the head of 
the route, and the other one is treated as the tail of the route. Then, one of the 
remaining markers, which has the shortest distance to either head or tail of the 
existing route, is identified and attached either before the head or after the tail to 
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form a longer route. This procedure continues until all markers in the group are 
included in the route. There is no guarantee that the length of the NN initial route 
would be shorter than any initial routes from other algorithms. There is no guar- 
antee either that one initial route with a shorter length would result in a shorter 
route after the second step of improvements. In practice, the NN algorithm can also 
start from everyone in the marker group, and construct n initial routes from 
n markers. Then the shortest route is identified and used as the initial route of the 
second step. One other option is to use the n initial routes in the second step and the 
shortest route is identified from the n improved routes. Of course, the latter option 
will take much longer computing time. 

Given n markers in one linkage group, represented by set G = {M;, Mə, ..., MA). 
Three steps are included in constructing the initial route. 


1) For any marker M, (i = 1, 2, ..., n) in set G, M, is treated both as the start and 
end points of the route with only one marker, and then M, is removed from set 
Go, where Go = G at the beginning of route construction. 

2) One marker in Go is found, which has the shortest distance from the endpoint of 
the existing route. This marker is attached to the existing route as the new 
endpoint and then is removed from set Go. 

3) If Go = o, one initial route is completed and the program repeats from step 
(1) for the next marker in G; otherwise, repeat from step (2) for the same 
marker in G. 


After the three steps, one initial route can be constructed for each marker. The 
shortest one can be identified from the n initial routes, and then used for route 
improvement. The route used in TSP is always closed. Connecting the start with 
the endpoints of the initial route will become a possible solution to TSP. Keeping the 
start and end points un-connected will become a possible genetic linkage map. 


2. Improvement of routes 


The k-Optimal algorithm (abbreviated as k-Opt, k = 2, 3, ...) was first proposed 
by Lin (1965). Lin and Kernighan (1973) further investigated its advantageous 
properties. k-Opt has been proven to be the most efficient approximate algorithm for 
solving large-scale TSP (Laporte, 1992). Detailed information on k-Opt can be 
found in previous publications. Only a brief description is given below for conve- 
nience. k-Opt begins with one initial route. 2-Opt breaks the initial close route from 
any two intervals, resulting in two fragments. A new route is formed by exchanging 
the start and end points of the two segments. If the new route is shorter than the 
initial one, it will be used as a new initial route for further improvement. 3-Opt 
breaks the initial close route from any three intervals, resulting in three fragments. 
A number of new routes can be formed by exchanging the start and end points of the 
three segments, but one with the shortest route length is selected and then compared 
with the previous route. 

It has been proved that when k= 2 (i.e., 2-Opt), KOpt makes the most 
significant improvement on initial routes in a short time even for hundreds of cities, 
and can also be easily implemented by computer programming. Figure 3.2 is a 
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Fic. 3.2 — A graphic representation of the Two-Opt improvement algorithm. (A) Route 
before exchange; (B) Route after exchange. 


graphic representation of the 2-Opt improvement algorithm. Assume the initial 
route is broken into two fragments from two marker intervals, i.e., between X and 
X + 1, and between Y and Y + 1 (figure 3.2A). A new route is formed by con- 
necting X with Y + 1, and X + 1 with Y (figure 3.2B). If the new route is shorter 
than the previous one, it will be used as the initial route for further improvement; 
otherwise, the initial route is broken into two fragments from other marker intervals. 

When k = 3, the initial close route is broken into three fragments from three 
marker intervals. There are six ways of connection, by which 6 new routes are formed 
and their lengths are compared. For the higher value of k, more fragments are 
formed and much more connecting ways have to be considered. Computing algo- 
rithms when k > 3 becomes much more complicated and also takes much longer 
time, which is less applied in solving actual TSP. 


3.2.38 Use of the k-Optimal Algorithm in Linkage Map 
Construction 


In certain sense, the problem in constructing the genetic linkage map can be treated 
as one TSP, when markers are treated as cities, and the estimated recombination 
frequencies between markers are treated as distance. But dissimilarities do occur. 
Firstly, the distance between any two marker loci is estimated by linkage analysis in 
a limited-size genetic population, and therefore may be associated with large sam- 
pling error. Secondly, the solution of one TSP is a close route, but the linkage map is 
open-ended. The best solution for TSP may not represent the best order of markers 
on the linkage map. Thirdly, if the solution of TSP can be viewed as a 
two-dimensional graph, the marker order on the genetic map is linear and 
one-dimensional. 

Figure 3.3 shows one TSP with 50 cities, where the coordinates of each city are 
known, and Euclidean distance is used to calculate the distance between any two 
cities. Using each city as a starting point to construct the initial NN route, the 
shortest close route is given in figure 3.3A for 2-Opt and given in figure 3.3B for 
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3-Opt. The algorithms used are called 2-OptTSP and 3-OptTSP, respectively. The 
shortest open route is given in figure 3.3C for 2-Opt and given in figure 3.3D for 
3-Opt. The algorithms used are called 2-OptMAP and 3-OptMAP, respectively. By 
close route length, the shortest one is 59 957.26 km among the 50 NN routes, 
53 467.78 km among the 50 2-OptTSP routes, and 53 206.44 km among the 50 
3-OptTSP routes. The best 2-OptTSP route is 10.82% shorter, and the best 
3-OptTSP route is 11.26% shorter than the shortest NN route, indicating the high 
efficiency of the £-Optimal algorithm in solving TSP. By open route length, the 
shortest one is 52 198.30 km among the 50 NN routes, 48 291.15 km among the 50 
2-OptMAP routes, and 47 909.84 km among the 50 3-OptMAP routes. The best 
2-OptMAP route is 7.49% shorter, and the best 3-OptMAP route is 8.22% shorter 
than the shortest NN route. 


A. 2-OptTSP: Two-optimal of close route B. 3-OptTSP: Three-optimal of close route 
10000 Length of the optimum route: 53467.78km 10000 Length of the optimum route: 53206.44km 
= 9000 = 9000 
2 8000 2 8000 
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= 6000 = 6000 
= 5000 = 5000 
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Fic. 3.3 — The best close route of 2-OptTSP and 3-OptTSP (i.e., A and B), and the best 
open route of 2-OptMAP and 3-OptMAP (C and D) in one TSP with 50 cities. 


Thick lines in figure 3.3A and B represent the largest intervals in the best 
2-OptTSP and 3-OptTSP close routes, respectively. Dashed lines in figure 3.3C and 
D connect the start and end cities in the best open routes from 2-OptMAP and 
3-OptMAP, respectively. The largest interval in the best close route was not the 
same as that in the best open route, indicating the optimal open route may not be 
identified by minimizing the close route length. In other words, the £-Optimal 
algorithm in solving TSP may not be directly used in linkage map construction 
(Zhang et al., 2020). Climer and Zhang (2006) pointed out that the optimization for 
an open route (such as map construction) can be rectified by optimization for a close 
route (such as TSP) by adding a dummy city whose distance to each other city is 
equal to a constant C. 

Results as shown in figure 3.3 may not be able to represent the construction of 
linkage maps completely. More appropriately, one chromosome is assumed to be 


Three-Point Analysis and Linkage Map Construction 115 


300 cM in length, with 50-300 randomly distributed markers. A total of 1000 
bi-parental DH populations are simulated to investigate the efficiency of the 
k-optimal algorithm in map construction. Ordering accuracies are also compared 
when recombination frequency (abbreviated as REC), LOD score (abbreviated as 
LOD), and map distance (abbreviated as DIS) are used as the distance parameters 
in ordering (see Zhang et al. (2020) for details). 

Rates of correct order from two types of routes, two improvement algorithms, 
and three distance parameters are shown in table 3.3. The correct order in table 3.3 
represents that markers in one simulated population follow exactly the same order as 
previously defined, by one initial route and one improvement. When 50 markers are 
randomly distributed on the chromosome, the estimated recombination frequency is 
used as the distance, and the objective is to minimize the length of the close route, 
the correct rate is 0.391 for 2-Opt. In other words, marker orders in 609 out of the 
1000 simulated populations are not completely the same as the true order. Due to 
the fixed length of the chromosome, marker intervals become smaller with the 
increase in the number of markers. It can be seen from table 3.3 that the 
densely-distributed markers can help to improve the quality of the constructed map. 
Whether ever the objective is to minimize the close or open route length, 3-Opt 
always has slightly higher rates of the correct order. On the other hand, it should be 
mentioned that the running time of the £-Opt algorithm increases quickly with the 
increase in markers, especially when k = 3. For 300-500 markers, 2-Opt normally 
takes tens of seconds, but 3-Opt may take hours. Therefore, time efficiency also 
needs to be considered when seeking high quality and high density on the con- 
structed linkage maps. 


TAB. 3.3 — Rates of correct order from two types of routes, two improvement algorithms, and 
three distance parameters by one initial route and one improvement. 


Route Algorithm Distance used Number of markers in one chromosome 
of 300 cM in length 


50 100 150 200 250 300 


Close 2-Opt Recom. Freq. 0.391 0.939 0.956 0.966 0.955 0.963 
LOD score 0.507 0.964 0.96 0.969 0.966 0.959 
Map distance 0.076 0.270 0.333 0.334 0.352 0.367 
3-Opt Recom. Freq. 0.434 0.975 0.999 1 1 0.999 
LOD score 0.546 0.989 1 1 1 1 
Map distance 0.100 0.396 0.545 0.582 0.668 0.678 
Open 2-Opt Recom. Freq. 0.527 0.986 1 1 1 1 
LOD score 0.560 0.989 1 1 1 1 
Map distance 0.344 0.956 0.991 0.994 0.997 0.997 
3-Opt Recom. Freq. 0.541 0.986 1 1 1 1 
LOD score 0.561 0.989 1 1 1 1 


Map distance 0.386 0.974 1 1 1 1 
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As far as the distance parameter is concerned, the correct rate is highest when 
the LOD score is used as the distance in ordering. Map distance gives the lowest 
correct rate. Obviously, different measurements of marker distance lead to 
non-identical orders. The number of the correctly-ordered populations and the 
including relationships are shown in figure 3.4 for the 2-Opt algorithm. By 
descending order, times of correctly ordering are 560, 527, and 344 for the three 
distance parameters. There are 299 times where the correct order is achieved by 
three distance parameters simultaneously. Though the number of correct orders 
from map distance is the least, 24 correct orders achieved by map distance cannot be 
achieved by recombination frequency or LOD score. Therefore, there is no guarantee 
that the correct order has to come from the measurement having the highest rate of 
correct ordering. In practical mapping populations, the best situation to be expected 
is that different route types (i.e., close and open), route improvement algorithms 
(i.e., 2-Opt and 3-Opt), and distance parameters (i.e., LOD, REC, and DIS) could 
give the same order. If this is not the case, it is better to consult with the physical 
map or previously constructed or published maps where part of the markers have 
been mapped, and then choose the most suitable order. 


distance: 
344 


Fic. 3.4 — Schematic representation of numbers of the correct order and the including 
relationship when using the MLE of recombination frequency r, LOD score, and map distance 
(cM) as the measurement of distance in the 2-Opt improvement algorithm. 


Due to the discrepancy between TSP and map construction, the k-optimal 
algorithm originally developed for TSP has been modified for map construction 
when using open route length to determine better routes (Zhang et al., 2020). The 
modified algorithms have been implemented in three software packages. The first 
one is called QTL IciMapping, which is mainly designed for genetic analysis in 
bi-parental populations and has implemented ten functionalities (Meng et al., 2015). 
The second one is called GACD which has four functionalities designed for double 
cross F; and clonal F; populations (Zhang et al., 2015c). The third one is called 
GAPL which also has four functionalities designed for pure-line populations which 
are derived from four to eight homozygous parents (Zhang et al., 2019). In fact, QTL 
IciMapping has been occasionally mentioned and applied in previous and current 
chapters. It will be applied again in chapters 4-6. GACD and GAPL will be applied 
in chapters 7 and 8, respectively. REC, LOD, and DIS provide all required 
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Option of pair- Option of Option of the k- 
wise distance ordering method Optimal algorithm 


Seriation 2-OptTSP 
K-Optimality 3-OptTSP 


By Anchor Order 2-OptMAP 
By Input Order 3-OptMAP 


Parameters 


[Threshold Value 


| k-Optimality vi Vindow Size: 5 $ İF” LOD Score 
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İ Ordering 


Random NN Routes ---1-------- > Number of random NN routes 
Previous Order 
Shortest NN From one to number of markers 
Option of initials for the k- Applicable only for the first 
Optimal improvement choice of initials; not needed for 


the other two options of initials 


Fic. 3.5 — The unified interface of parameter setting in linkage map construction in three 
software packages QTL IciMapping, GACD, and GAPL. 


information for map construction. Once REC, LOD, and DIS between any two 
markers have been estimated, population-specific information (e.g., population 
type, size, and marker type) is not needed anymore for the next step of linkage map 
construction. Therefore, MAP functionality in QTL IciMapping, CDM functionality 
in GACD, and PLM functionality in GAPL share the same interface for parameter 
setting in linkage map construction, and the same interface for user manipulation 
(figure 3.5). Actually, much more options are provided by the interference than 
could be introduced here. By the interface, linkage maps can be constructed with 
ease for a wide range of genetic populations. 


3.2.4 Rippling of the Ordered Markers 


For one linkage group with tens of markers, the 2-opt algorithm may acquire the 
order with the shortest length from a number of initial routes and improvements. 
But for larger marker numbers, the shortest order may not always be achieved. By 
rippling, one shorter order could be further identified. Rippling can be treated as one 
further improvement after the ordering algorithm. Two steps are involved in rippling 
the ordered markers in a linkage group. 


(1) Choose a window with size w, i.e., containing a number of w succeeding 
markers. Window size w is normally between 5 and 10. For small window sizes, 
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no improvement could happen; but a longer time is required for large window 
sizes. Assume there are a number of n markers in the group, where n > w. 

(2) To repeat for i= 1, 2, ..., n-w. Given i, map lengths are calculated and com- 
pared for all possible orders on the w markers in the window, i.e., M;, Məkr, ..., 
M;+u: A number of possible orders are equal to ul, e.g., 120 for w = 5, and 
40 320 for w = 8. If a new order shorter than the existing one could be iden- 
tified, the previous order would be replaced by the new one. 


In fact, for map construction, the seriation ordering algorithm (Buetow and 
Chakravarti, 1987) takes the NN routes as initial orders and uses rippling to 
repeatedly improve the initial orders by various sizes of marker windows. Using the 
LOD score given in table 3.2 as the distance parameter, the order of 14 markers in 
the barley DH population is acquired and given in table 3.4. The average marker 
interval is 12.34 cM, and the largest interval is 60.59 cM in length determined by 
markers aHor2 and MWG943. The coefficient of interference between two neigh- 
boring intervals is also given in table 3.4. Both positive and negative interference can 
be observed on chromosome 1H in barley. 


TAB. 3.4 — Order of 14 markers on chromosome 1H in barley constructed from the barley DH 
population. 


Order on Marker Distance (cM) from Position Coefficient of 
linkage map name previous marker in cM interference 
1 Act8A 0 

2 OP06 10.88 10.88 

3 aHor2 7.64 18.52 —3.422 

4 MVVG943 60.59 79.11 —0.037 

5 ABG464 13.07 92.18 0.174 

6 Dor3 19.28 111.46 0.930 

7 iPgd2 3.55 115.01 0.617 

8 cMWG733A 7.04 122.05 0.053 

9 AtpbA 3.56 125.61 0.899 

10 drun8 13.61 139.22 1.678 

11 ABC261 4.88 144.10 —4.326 

12 ABG710B 7.09 151.19 —1.061 

13 Aga7 3.50 154.69 2.157 

14 MVVG912 5.70 160.39 1.454 


3.2.5 Integration of Multiple Maps 


Many genetic populations are suitable for constructing linkage map, such as 
bi-parental populations which have been introduced in chapter 2, and multi-parental 
populations which will be introduced in chapters 7 and 8. In human and animal 
studies, nuclear families are commonly used in linkage analysis and map construc- 
tion. The readers can refer to literature or books on human genetics, if interested. 


Three-Point Analysis and Linkage Map Construction 119 


Sometimes, one set of markers is used in genotyping in a number of populations. 
Sometimes, different sets of markers are used in different populations. Even though 
the same set of markers is used in two populations, some markers are polymorphic in 
one population but not in the other. Non-polymorphism markers cannot be included 
in linkage analysis and therefore cannot be located on the constructed linkage map 
either. More often, a number of populations only have some markers in common. 

For individual chromosomes in one species, multiple linkage maps can be con- 
structed from different populations by different researchers. Different markers (some 
of them can be common) are distributed on these maps which may have different 
map lengths and orders. By using common markers on these maps, it is possible to 
construct one integrated genetic map, which is also called a consensus map some- 
times. For example, Qu et al. (2020) constructed the first integrated linkage map in 
rice from three multi-parental populations and conducted QTL mapping for heading 
date and plant height; Qu et al. (2021) provided a consensus map from three pop- 
ulations of RILs in wheat using the 90K single nucleotide polymorphism 
(SNP) array. A consensus map combines genetic information from multiple popu- 
lations, providing an effective alternative to improve the genome coverage and 
marker density. The consensus map can be used for some particular purposes, such 
as the imputation of missing genotypic data in some populations, and simulation 
and prediction studies in breeding (Yao et al., 2018). 

Table 3.5 gives a consensus map for one chromosome in Arabidopsis, which is 
integrated from three linkage maps. The three individual maps include 13, 18, and 
20 markers, with map lengths 89 cM, 99 cM, and 94 cM, respectively. The inte- 
grated map has 25 markers which are unique on the three individual maps. Seven of 
the 25 unique markers occur on just one map, 10 of them occur on two maps, and 
eight of them occur on three maps. On individual maps, the number 0 represents the 
unique marker is not included; a number greater than 0 represents the order of the 
unique marker. From common markers, pair-wise map distances are re-calculated for 
the 25 unique markers. Take two markers, i.e., order 5 (SNP107) and order 8 
(SNP100), on the integrated map as an example. On the first map, they are located 
at order 2 and order 5 apart from 17 cM (i.e., 25.0-8.0). On the second map, they 
are located at order 3 and order 5 apart from 13 cM (i.e., 24.0-11.0). On the third 
map, they are located at order 3 and order 6 apart from 18 cM (i.e., 28.0-10.0). The 
average value of the three distances is used as the distance between SNP107 and 
SNP100 on the consensus map, i.e., 16 cM. In a similar way, all pair-wise distances 
can be acquired and given in a lower triangular format similar to table 3.2 and then 
used to construct the integrated map. The integrated map is 132.2 cM in length, 
longer than any of the individual maps. A linkage map with more markers is always 
longer than a map with fewer markers. This is due to some double crossing-overs 
being un-identified when marker intervals are large but may become identifiable 
when more markers are added in the large intervals. 

In constructing the integrated map, the distance between some markers may not 
be able to be estimated. For example, for marker SNP388 located at order 3 on the 
third map, and marker SNP301 located at order 17 on the first map, the three 
individual maps do not have the required information to estimate their distance. In 
this situation, similar to the greater than 0.5 estimates of recombination frequency 
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TAB. 3.5 — Integration of three linkages maps sharing common markers. 
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(table 3.2), a distance of 1000 cM is assigned to such two markers. In fact, not every 
distance has to be used in map construction. But from the integrated map, the 
distance between the two markers can still be given, i.e., 78.2 cM (i.e., 88.2-10.0). 


8.3 Comparison of the Recombination Frequency 
Estimation in Different Populations 


Accurate estimation of recombination frequency is essential to linkage analysis and 
map construction. Many factors can affect the accuracy of recombination frequency 
estimation, such as population size, number of identifiable genotypes, missing val- 
ues, genotyping errors, and segregation distortion. In theory, Fisher’s information 
criterion (see §2.3) can be used to compare the estimation by considering these 
factors. But the derivation of Fisher’s information can quickly become intractable 
when more factors are added. In this session, the estimation efficiency of recombi- 
nation frequency in bi-parental populations will be briefly mentioned from a simu- 
lation study. The criteria used for comparison are the LOD score in testing the 
linkage relationship, the deviation between the estimated and true recombination 
frequencies, the standard error of the estimate, and the least theoretical population 
size required to observe at least one recombinant and to declare the statistically 
significant linkage relationship. More details can be found on Sun et al. (2012). 

As shown in figure 1.1, one backcross population using P, as the recurrent parent 
has same the genetic structure as the other one using Pə as the recurrent parent for 
co-dominant loci. Therefore, only the backcrossing with parent Pı is considered 
below. A total of twelve bi-parental populations are simulated to compare the 
precision in recombination frequency estimation, i.e., Fo, F3, RIL, DH, P BCF, 
P,BC,Fs, P;BC,RIL, P;BC,DH, P:BCəF), PyBC2F2, P}; BC2RIL, and P;BC2DH as 
given in figure 1.1, chapter 1. For convenience, parent Pı is removed from the 
population names without causing any confusion in this session. Based on gene 
frequency, the 12 populations can be classified into three categories. In F,-derived 
populations (incl. F>, F3, RIL and DH), the frequency of the P, allele is 0.5 at each 
locus. In BC,F-derived populations (incl. BC,F,, BCF», BC,RIL, and BC,DH), 
the frequency of the P; allele is 0.75 at each locus. In BC2Fj-derived populations 
(incl. BC2F,, BC2F2, BC2RIL, and BC2DH), the frequency of the P; allele is 0.875 at 
each locus. 


3.3.1 LOD Score in Testing the Linkage Relationship 
in Different Populations 


LOD score is the test statistic in detecting the significance of the linkage relationship 
between two marker loci. When the LOD score is greater than a threshold 
value (e.g., 3 in most cases), the two loci are declared to be significantly linked; 
otherwise, the linkage relationship is declared to be non-significant. The higher the 
LOD score, the more significant the linkage relationship between the two loci under 
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consideration. Averaged LOD scores and the respective standard errors (SE) in 12 
bi-parental populations under three levels of population size (PS) are shown in 
figure 3.6A for true recombination frequency r = 0.05 and shown in figure 3.6B for 
true recombination frequency r = 0.2. The average values are calculated from 1000 
simulations. In temporary populations, only co-dominant markers are considered. It 
is clear that larger PS and smaller rresult in higher LOD scores, regardless of the type 
of the populations, indicating that large PS and small rare favorable to the detection 
of linkage relationships. 

LOD scores of the 12 bi-parental populations are significantly different 
(figure 3.6A and B), but a similar trend can be seen for different r and PS values. 
LOD scores from four F,-derived populations (£.e., Fə, F3, FADH, and F,RIL) are 
higher than those from BC,F, and three BCiF:-derived populations (i.e., BCiF3ə, 
BC,DH, and BC,RIL), respectively. BC,;F, and three BC,F,-derived populations 
have higher LOD scores than BCF; and three BC2F,-derived populations (i.e., 
BCF», BCDH, and BC2RIL), respectively. Each generation of backcrossing reduces 
the frequency of the non-recurrent parental allele by half. Gene frequency apart from 
0.5 is detrimental to the detection of linkage relationships in bi-parental populations. 
Therefore, two or more generations of backcrossing populations are not frequently 
seen in genetic studies on linkage analysis and map construction. BC,F, and BCF; 
have LOD scores similar to F; DH and BC,DH, respectively, due to the same number 
of genotypes and the same genotypic frequencies (see tables 2.4 and 2.5 in chapter 2). 

Populations having the same gene frequencies can still have different LOD scores 
and therefore give unequal powers in linkage analysis. For r = 0.05 and gene fre- 
quency 0.5, the four populations are F2, F3, DH, and RIL by the ascending order on 
the LOD score (figure 3.6A). For two co-dominant markers, F and F; include nine 
identifiable genotypes, providing the most information on crossing-over and 
recombination, and therefore have the highest LOD score. In comparison, there are 
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Fic. 3.6 — Average LOD score in testing the linkage relationship for two recombination 
frequencies 0.05 (A) and 0.2 (B), acquired from 1000 times of simulation for each population. 
In temporary populations, two linked loci are both assumed to be co-dominant. 
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only four homozygous genotypes included in DH and RIL. The difference between Fə 
and F; comes from the different genotypic frequencies; as is also the case for the 
difference between DH and RIL. In comparison with F», one more generation of 
selfing happens in F3, giving more chance for crossing-over to happen. It can be 
calculated, the accumulated recombination frequency after two generations of 
selfing in Fs is for sure to be greater than the one-generation recombination fre- 
quency estimated in Fy. The accumulation in recombination frequency reduces the 
linkage relationship, or linkage dis-equilibrium in other words, between two markers, 
and causes the reduction in LOD score. Similarly, the F,-derived DH population 
endures only one generation of meiosis and then becomes homozygous immediately. 
RIL population approaches homozygous gradually during repeated selfing. The 
accumulated recombination frequency is about two times the one-meiosis recombi- 
nation frequency r when r is small (see equation 2.7 and exercise 2.8). 

For r = 0.05 and gene frequency 0.75, the four populations are BC,F2, BCF, 
BC,DH, and BC,RIL by the ascending order on LOD score (figure 3.6A). For 
populations BC,F,; and BC,Fs, the higher LOD score in BCiFə can also be 
explained by the higher number of genotypes. BCF; includes only four genotypes; 
after one generation of selfing, BCF» has nine genotypes, similar to F». The 
increased number of genotypes provides additional information on crossing-over and 
recombination. Similarly, lower LOD scores in BC DH and BC,RIL can also be 
explained by the smaller number of genotypes in both populations. For r = 0.05 and 
gene frequency 0.875, the four populations are BC2F2, BCyF,, BC2DH, and BC2RIL 
in the ascending order on the LOD score (figure 3.6A), similar to gene frequency 
0.75. For true recombination frequency 0.2 (figure 3.6B), the order of populations by 
LOD score is not completely the same as the true value 0.05 as shown in figure 3.6A. 

Theoretical and simulation results indicate that the LOD score can be affected 
by the type and size of the mapping population, gene frequency, number of geno- 
types, genotypic frequencies, and number of meiosis generations in population 
development, etc. Generally speaking, an ideal mapping population should have 
equal allelic frequencies at each locus, a large number of identifiable genotypes, fewer 
meiosis generations, and finally more individuals or families. 


3.3.2 Accuracy of the Estimated Recombination Frequency 


Deviations between the estimated recombination frequencies and true value 0.3 
together with their standard errors are shown in figure 3.7 for three levels of PS. In 
temporary populations, two linked loci are both assumed to be co-dominant. Less 
deviation and smaller SE indicate higher accuracy in recombination frequency 
estimation. As expected, deviation and SE of the estimated recombination frequency 
decline with the increase in PS. When PS = 200, deviation, and SE are almost equal 
to zero for the 12 populations, indicating that the large-sized populations always 
lead to high accuracies in estimation. Deviations in BC2F-related populations (1.e., 
BC2F,, BC2DH, BC2RIL, and BCF») are generally higher than those in the other 8 
populations, indicating that more generations of backcrossing are not favored for the 
precision estimation of recombination frequency. Populations with equal size and 
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equal gene frequency can also have non-equal accuracies due to the difference in the 
number of genotypes and genotypic frequencies. 

In figures 3.5 and 3.6, both markers are assumed to be co-dominant. Obviously, 
results shown previously cannot be simply extended to other marker categories as 
given in tables 2.6—2.10 in chapter 2. When both markers are dominant, the Fə 
population has only four identifiable genotypes, and the accuracy in recombination 
frequency estimation may not be higher than the accuracy in DH or RIL populations 
of the same size. In some backcross populations, recombination frequency can 
become un-estimated, needless to mention the estimating accuracy. The effect of 
dominant and recessive markers on linkage analysis will be discussed in detail in the 
next section. 
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Fic. 3.7 — Average deviation (A) and standard error (B) of the estimated recombination 
frequency for true value 0.3, acquired from 1000 times of simulation for each population. In 
temporary populations, two linked loci are both assumed to be co-dominant. 


3.3.3 Least Population Size to Declare the Significant 
Linkage Relationship and Close Linkage 


To detect the linkage relationship, the mapping population needs to be large 
enough. In one aspect, at least one recombinant individual needs to be observed so 
that the estimated recombination frequency can be greater than 0 for tight linkage. 
In the other aspect, the LOD score needs to be greater than a threshold of 3, for 
example, so that the estimated recombination frequency can be significantly smaller 
than 0.5 for loose linkage. For convenience, the least population for at least one 
recombinant to be present at the 95% probability level is given in table 3.6; the least 
population for a LOD score to be greater than the threshold of 3 is given in table 3.7. 

§3.3.1 shows that the LOD score is high for tight linkage, but low for loose linkage. 
When two markers are closely linked but still not co-located at the same locus, it may 
not be an issue to have a LOD score greater than the threshold. The issue is that at 
least one crossing-over event has to be observed in the population (table 3.6); 
otherwise, the recombination frequency will be estimated as 0. When two markers are 
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not closely linked, it may not be an issue to observe the crossing-over events in the 
population. The issue is to have a LOD score greater than the threshold (table 3.7); 
otherwise, the two markers will be declared as non-significantly linked. Therefore, in 
practice, the larger one between the two PS values given in tables 3.6 and 3.7 should 


TAB. 3.6 — The least population size to assure a probability 95% that at least one 
recombinant is present in the population. 


Population True recombination frequency 

r=0.01 r=002 r=0.03 r=0.05 r=01 r=02 r=0.3 
F» (CC) 150 75 50 30 15 
Fə (CD, CR) 299 149 99 60 31 16 11 
Fə (DD, RR) 299 149 99 60 31 16 11 
Fə (DR) 149 786 209 956 13 616 4754 1197 299 132 
F; (CC) 121 61 41 25 13 7 5 
F; (CD, CR) 199 99 67 41 21 11 8 
F; (DD, RR) 213 99 67 41 21 11 
F; (DR) 998 598 373 229 110 52 34 
DH 299 149 99 59 29 14 9 
RIL 152 TT 52 32 17 9 T 
BCF; 299 149 99 59 29 14 9 
(CC, CR, RR) 
BCF» (CC) 172 86 58 35 18 9 T7 
BCF» (CD) 427 199 135 82 43 24 17 
BCF» (CR) 249 119 80 48 24 12 8 
BCF» (DD) 373 213 135 82 43 24 17 
BCF, (DR) 2995 998 748 427 213 102 66 
BCF (RR) 249 124 82 49 24 12 8 
BC,DH 300 150 100 60 31 16 11 
BC,RIL 203 103 70 43 23 13 10 
BCF; 300 150 100 60 31 16 11 
(CC, CR, RR) 
BCF» (CC) 242 122 82 50 27 15 11 
BCF» (CD) 748 332 213 129 70 39 31 
BCF» (CR) 299 157 99 61 32 17 12 
BCF» (DD) 748 299 213 124 70 39 31 
BCF» (DR) 2995 1497 748 498 249 124 85 
BCF, (RR) 299 149 99 61 32 17 12 
BC,DH 403 203 136 83 43 24 17 
BCəRIL 305 156 106 66 36 21 16 


Notes: Given behind the names of temporary populations are the dominant and recessive 
relationships, i.e., CC for two co-dominant markers, CD for one co-dominant marker and one 
dominant marker, CR for one co-dominant marker and one dominant marker, DD for two 
dominant markers, DR for one dominant marker and one recessive marker, and RR for two 
recessive markers. For BCF; and BC2F,, recombination frequency cannot be estimated for 
CD, DD, and DR: the least population size cannot be given. 
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be used as the least population size. For example, for BC,F, and two co-dominant 
markers, 299 individuals are needed to be 95% sure to observe at least one recom- 
binant for r = 0.01; only 11 are needed to make the linkage relationship statistically 


TAB. 3.7 — The least population size to assure that LOD score is greater than 3 in testing the 
linkage relationship in the population. 


Population True recombination frequency 

r=0.01 r=002 r=0.03 r=0.05 r=01 r=02 r—03 
F» (C, ©) 8 9 9 11 15 31 78 
Fə (CD, CR) 14 15 16 19 26 51 123 
F, (DD, RR) 14 15 16 19 27 56 147 
Fə (DR) 82 83 83 86 96 138 262 
F; (CC) 8 9 9 11 17 41 121 
F; (CD, CR) 12 13 15 17 26 58 162 
F; (DD, RR) 12 14 15 17 26 62 179 
F; (DR) 31 33 35 39 52 100 246 
DH 11 12 13 14 19 36 84 
RIL 12 14 15 18 29 73 219 
BCF; 
(CC, CR, RR) 11 12 13 14 19 36 84 
BCF (CC) 9 10 11 12 18 40 107 
BCF» (CD) 21 23 25 29 42 90 236 
BCF» (CR) 12 13 14 16 23 49 125 
BCF (DD) 21 23 25 29 44 101 289 
BCF (DR) 54 57 59 65 84 150 343 
BCF (RR) 12 13 14 16 23 49 128 
BC,DH 14 15 16 19 27 56 147 
BC,RIL 15 16 18 22 34 83 238 
BCF; 
(CC, CR, RR) 14 15 16 19 27 56 147 
BCF» (CC) 13 15 16 19 29 68 199 
BCF» (CD) 34 37 41 48 72 166 469 
BCF» (CR) 17 18 20 23 34 78 218 
BCF» (DD) 34 38 41 49 76 193 606 
BCF» (DR) 66 70 75 84 114 229 585 
BCF» (RR) 17 18 20 23 34 79 220 
BC,DH 21 23 25 29 44 101 289 
BCŞRIL 22 24 27 33 52 133 406 


Notes: Given behind the names of temporary populations are the dominant and recessive 
relationships, i.e., CC for two co-dominant markers, CD for one co-dominant marker and one 
dominant marker, CR for one co-dominant marker and one dominant marker, DD for two 
dominant markers, DR for one dominant marker and one recessive marker, and RR for two 
recessive markers. For BC,F, and BCF, recombination frequency cannot be estimated for 
CD, DD, and DR: the least population size cannot be given. 
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significant. Therefore, the BCF; population should have the least size at 299 to 
detect two markers linked at r = 0.01. 

For a specific level of r, the least PS for BCəF)-derived populations is always 
larger than other populations. For example, for r = 0.03 and co-dominant markers, 
the least PS required to observe at least one recombinant is 136 for BCDH 
(table 3.6), which is the largest among the 12 populations. The least PS required to 
detect the significant linkage relationship is 27 for BCŞRTL (table 3.7), which is 
the largest among the 12 populations. The results indicate once again that more 
generations of backcrossing are not favored in linkage analysis. 

The dominant and recessive relationship between two alleles at each locus has no 
obvious effect in permanent populations but can make a big difference in temporary 
populations. Tables 3.6 and 3.7 also give the least population size when one marker or 
both markers are not co-dominant. For example, for r = 0.01 and F population, at 
least one recombinant can be observed in 150 individuals at a probability of 95% when 
both markers are co-dominant. But 300 individuals are needed to observe at least one 
recombinant when one marker becomes dominant. When one is dominant and the 
other one is recessive, a huge population is needed to observe at least one recombinant 
(table 3.6). Therefore, the close linkage between one dominant marker and one 
recessive marker represents the worst situation in linkage analysis in temporary 
populations. Therefore, co-dominant markers should be used in genotyping when 
temporary populations are used in genetic studies. Linkage relationships between 
dominant markers and recessive markers should be avoided as much as possible. 


3.4 Linkage Analysis in Random Mating Populations 


3.4.1 Linkage Dis-Equilibrium in Random Mating 
Populations 


Properties of the randomly mated populations and the concept of linkage 
dis-equilibrium will be briefly introduced here. More details can be found in the 
textbook on population or quantitative genetics, e.g., Falconer and Mackay (1996) 
and Wang (2017). At any locus, the population will reach the Hardy-Weinberg 
equilibrium after one generation of random mating if other factors, such as limited 
size, mutation, migration, and selection, are not considered. For populations in 
Hardy-Weinberg equilibrium, genotypic frequencies at each locus can be derived 
from the frequencies of alleles at the locus. In such populations, the number of 
genotypes is much larger than the number of alleles, especially when multiple alleles 
are present and when multiple loci have to be considered. Linkage relationship is 
complicated if based on genotypes, but can be more easily investigated on the 
gamete or haploid level. By random mating system, the mating between diploid 
individuals is equivalent to the mating between the female and male gametes gen- 
erated by diploid individuals in the population. Frequencies of the diploid genotypes 
at multiple loci can also be deducted by a random combination of the haploid 
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gametes. Therefore, the linkage relationship at the gametic phase is actually 
equivalent to the relationship in the diploid progenies. 

When two loci, each with two alleles, are not genetically linked or not associated 
in other words, frequencies of the four gametes, i.e., AB, Ab, aB, and ab, can be 
calculated from the allele frequencies at each locus, corresponding to the expanding 
items in polynomial equation (pa + pa)(pa + pa). Such two loci are called at equi- 
librium, or not to be associated, or to be independent as far as the random mating 
system is concerned. When the four gametes have frequencies not equal to their 
equilibrium frequencies, the two loci are said to be at the gametic phase 
dis-equilibrium. The degree of dis-equilibrium, represented by D, can be measured 
for each gamete by the deviation from its equilibrium frequency, which is given in 
equation 3.14, where paz, PAb, Pap, and Pag are the actual frequencies of gametes in 
the population, summing up to one. 


Dap = Pap — PAPB, Dab = PAb — PAP; (3.14) 
DaB = PaB — PaPB, Dab = Pab — Papo 


Obviously, the frequencies of four alleles at the two loci can be calculated from 
the frequencies of the four gametes, which are given in equation 3.15. 


PA = PAB T PAa, Pa = PaB T Pab, (3 15) 
PB = PAB T PaB, Pb = PAb T Pab 


Replacing the four allele frequencies in equation 3.14 with those given in equa- 
tion 3.15, and considering the four gametes have a sum frequency equal to one, a 
relationship between D and the gametic frequencies can be derived, i.e., equation 3.16. 


Dap = Dab = PABPab — PAbPaB, (3.16) 
Day = Dag = —(PABPab — PAbPaB) 


The four degrees given in equation 3.16 are identical in their absolute values, 
which is called the dis-equilibrium at the gametic phase between the two loci. For 
convenience, the linkage phase in gametes AB and ab is called coupling, then the 
phase in gametes Ab and aB is called repulsive. Therefore, dis-equilibrium is actually 
the difference between pABpap (i.e., half of the frequency of diploid genotype AB/ab 
coming from the random combination of two coupling-phase gametes) and pAşpas 
(i.e., half of the frequency of diploid genotype Ab/aB coming from the random 
combination of two repulsive-phase gametes). Therefore, the gametic phase 
dis-equilibrium can be also understood as half of the difference in frequencies of the 
two phases of the double heterozygote. When the two phases have equal frequency, 
the two loci under consideration are at equilibrium; otherwise, dis-equilibrium 
occurs. 

For convenience, D as given in equation 3.17 is requested to be equal to or 
greater than 0. If a negative value is acquired from equation 3.17, a positive value 
can be easily acquired by shifting the coupling and repulsive relationship. Therefore, 
a positive D value from equation 3.17 is assumed in the following discussions. 


D = PABPab — PAbPaB (3.17) 
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As has been seen in the unselected bi-parental populations in chapter 2, fre- 
quencies of both haploid types and diploid genotypes at two linked loci cannot be 
acquired simply from the frequencies at each locus. They are dependent on recom- 
bination frequency between the two linked loci. Two loci are at dis-equilibrium unless 
they are located on two chromosomes. In random mating populations, D as defined in 
equation 3.17 is also called linkage dis-equilibrium (LD) in genetics. But it needs to 
be mentioned that in addition to genetic linkage there are some other factors, such as 
population admixture and selection, which can also cause the dis-equilibrium. 

On the other hand, when dis-equilibrium is known, the four gametic frequencies 
can be expressed by the equilibrium frequencies and the degree of dis-equilibrium, as 
given in equation 3.18. 


PAB = PAPB+D, Day = Paps — D, (3.18) 
PaB = PaPB — D, Pab = Papo + D : 


Now we can continue to see how dis-equilibrium changes by random mating. Let 
D; represent the dis-equilibrium between two loci at the first generation of random 
mating, D, represent the dis-equilibrium after t (t> 1) generations of random 
mating, which can be proved to be equation 3.19, where r is recombination 
frequency. 


D, = Dia —r)** (3.19) 


From equations 3.18 and 3.19, the four gametic frequencies after £ generations of 
random mating can be calculated from equation 3.20. 


PAB = papB + Di, Pay = Papo — Di, 
(3.20) 
PaB = PaPB— Di, Pab = PaPo + Di 


Considering an F, hybrid between two homozygous parents as an example, 
parents P4 and P> have genotypes AABB and aabb, recombination frequency 
between the two loci is r, and the four alleles A, a, B and b have equal frequency Z 
Theoretical frequencies of four gametes AB, Ab, aB, and ab generated by the F, 
hybrid are equal to 1(1 — r), 4r, tr, and 2(1 — r), respectively. If the production of 
gametes through meiosis is treated as the starting point of a new generation, the 
gametic-phase dis-equilibrium can be calculated by equation 3.17, which is also 


shown in equation 3.21. 


(1 — r) əx gra qu 2r) (3.21) 

The readers can confirm that the four expected gametic frequencies previously 
mentioned are equal to ; + Di, ; -D, ; — D; and 1 + Dj, respectively. The random 
combination between the four gametes will generate the Fy population, therefore the 
expected genotypic frequencies in Fy can be derived from the four gametic fre- 
quencies. Fy is sometimes treated as a special case of random mating, where the 
parental population (i.e., Fı hybrid) has one single genotype, and two alleles have an 
equal frequency of 0.5 at each locus. Starting from Fy, dis-equilibrium after 
t generations of random mating is given in equation 3.22. 
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D, = r — 2r)(1 — r)”1 (3.22) 


From equation 3.20, after £ generations of random mating, the four gametes AB, 
Ab, aB, and ab have frequencies } + D;, 1— Di, 1— Di, and ++ D;, respectively. 
Similar to the repeated selfing, an accumulated recombination frequency R can also 
be defined after ¢ generations of random mating, and the four gametic frequencies 
can be re-written as 2(1 — R), ¿ R, 4 R, and 2(1 — R), respectively. The relationship 
between R and the one-meiosis recombination frequency r can be identified as 
equation 3.23. When r is small, 1 — (£ — 1)r can be used to approximate (1 — r)”. 
By ignoring the squared term on r, an approximate relationship between R and rcan 
be given in equation 3.24. Obviously, random mating can also accumulate the 
crossing-over events in the progeny population, and therefore helps to construct 
high-resolution linkage maps and improve the accuracy in gene mapping (Frisch and 
Melchinger, 2008; Darvasi and Soller, 1995). 


1 1 t-1 
R=5—-2D,=5[1- (1-2-1) (3.23) 
Re f+ 5(e-0)]r (3.24) 


3.4.2 Generation Transition Matrix from Diploid 
Genotypes to Haploid Gametes 
Following the representations and symbols, as have been used on genotypes and 


their frequencies in §2.1 in chapter 2, pe ) represents frequencies of 10 genotypes in a 
population before random mating, i.e., equation 3.25. 


0) — | 0) (0) (0) (0) (0) (0) (0) (0) (0) (0) 
p = Er PAABb PAAbb PaaBB PaB/ab Palas Paabb PaaBB PaaBb Paabb 
(3.25) 


The four allele frequencies can be acquired from genotypic frequencies, i.e., 
equation 3.26. 


PA = over T un + ave + ; (Danse ae 27 as 7 

-. : Cie F .. + bu) F Pn T Pave F aT 00 
PB = Pian f 5 t Prope 4 ! E f 20 + Pros): | 

Do = : Cove + nn + Poop) + bn T Din + pu 


2 


Assuming there are no other factors affecting the structure of progeny popula- 
tions, e.g., random drift and admixture, allele frequencies as given in equation 3.26 
would keep unchanged during random mating. Summarized below is the idea to 
derive the 10 genotypic frequencies after several generations of random mating. 
Given the frequencies of four gametes generated by one population, the 
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gametic-phase dis-equilibrium at the first generation of random mating, 2.e., D,, can 
be firstly calculated from equation 3.17. Secondly, dis-equilibrium after £ generations 
of random mating, i.e., D, can be calculated from equation 3.22. Thirdly, the four 
gametic frequencies after £ generations of random mating can be calculated from 
equation 3.20. Finally, the 10 genotypic frequencies can be acquired from the ran- 
dom combination among the four gametes. 

A generation transition matrix is needed to calculate the gametic frequencies for 
the first generation of random mating. The matrix is denoted by Tram, which is given in 
equation 3.27. Based on the number of heterozygous loci, three cases will be consid- 
ered on the 10 genotypes, i.e., no heterozygote, one-locus heterozygote, and two-locus 
heterozygote. The four gametes AB, Ab, aB, and ab are called types 1, 2, 3, and 4, 
respectively. 


(1) No heterozygote, or homozygous at both loci. Four of the 10 genotypes belong to 
this case. Each of them produces only one type of gamete, i.e., AABB produces 
type 1, AAbb produces type 2, aaBB produces type 3, and aabb produces type 4. 
Therefore, in transition matrix Thu, the first element in row 1, the second ele- 
ment in row 3, the third element in row 8, the fourth element in row 10 are equal 
to 1; the other elements in rows 1, 3, 8 and 10 are all equal to 0 (equation 3.27). 

(2) One-locus heterozygote and one-locus homozygote. At the heterozygous locus, 
two types of gamete will be produced with equal frequencies. Take AABb as an 
example, type 1 and type 2 are produced, each with frequency 2 Therefore, in 
the transition matrix Tp, the first and second elements in row 2 are equal to 5 
and the other two elements in row 2 are equal to 0 (equation 3.27). Probability 
vectors for AaBB, Aabb, and aaBb can be determined in a similar way. 

(3) Double heterozygote, i.e., AB/ab and Ab/aB. Genotype AB/ab produces four 
types of gamete with frequencies 3(1 — r), 1r, ir and 2(1 — r), respectively, 
corresponding to the four elements in row 5 in equation 3.27. Genotype Ab/aB 
produces the four types of gamete with frequencies 4r, 2 (1 — r), 4(1 — r) and 
27, respectively, corresponding to the four elements in row 6 in equation 3.27. 
Obviously, parental and recombinant relationships are shifted in gametes pro- 
duced by the two types of the double heterozygote. 


1 0 0 0 
1 1 
z = 0 
2 2 
0 1 0 0 
1 1 
? : ? ° 
1 1 1 
20-7) =r ə” 20-r) 
Tru = 1 1 1 1 (3.27) 
3” 5 (1 — r) 5 (1 — r) ar 
0 : 0 1 
2 2 
0 0 1 0 
1 1 
0 = 2 
i 2 2 
0 0 0 1 
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Therefore, given the 10 genotypic frequencies as arranged in equation 3.25, fre- 
quencies of the four gametes at the first generation of random mating can be cal- 
culated from equation 3.28. The gametic-phase dis-equilibrium at the first 
generation of random mating is given by equation 3.29. 


Dü, gü 0 pl) ) =p Tra (3.28) 
Di = pho — Paps (3.29) 


3.4.3 Gametic and Genotypic Frequencies in Populations 
After Several Generations of Random Mating 


The population before random mating is represented by £ — 0, and genotypic fre- 
quencies are given by equation 3.25. Given by equation 3.29 is dis-equilibrium at the 
first generation of random mating, i.e., Dı. Given by equation 3.19 is dis-equilibrium 
after t (t > 0) generations of random mating, i.e., D}. Therefore, after t generations 
of random mating, the four gametic frequencies can be calculated by equation 3.30, 
and the 10 genotypic frequencies can be calculated by equation 3.31. 


pis = pappt+ Di, py) = papy — Di, (3.30) 
(t) (00 : 
Pap = PaPb — Di, Pop = Papo + Di 


(t) (032 (t) (t) (t) | (t) (012 
PAABB = [Papl , Paap > 2D 4p X 7 Paam = [pal > 
7 = 2p, x D 77 E 2p ə . 
000” 

t t t t t t j2 
o = RA bo = öpüb x p? , ə = pt) İ 


(3.31) 


Taking the Fı (t= 0) as the starting population, 10 genotypic frequencies are 
given by, 


p®=(0 0001000 0 0) 


Allele frequencies are given by, 


1 
PA = Pa = PB 5 P55 
2 
Dis-equilibrium after t generations of random mating can be found in equa- 
tion 3.22. Therefore, the four gametic frequencies after t generations of random 
mating are given by, 


Three-Point Analysis and Linkage Map Construction 133 


1 1 1 1 
pl =z b 20 - 2/)(1 — r), wf - 2-20 - ana), 

. 4 (3.32) 
) 1 1 1 () 1 1 ici 
PaB = 4 zü 2r)(1 r) ? Dap — 4 F (1 2r)(1 r) 


The 10 genotypic frequencies after t generations of random mating are given by, 


1 z 1 ı 
2. B+- 2r- r}, Pulun = gl (1—2ry?a — ry), 


1 = 1 g 
Pisa = 16 5200 Duş: = gil (1 — 2r)7(1 — rj), 


1 = 1 = 
Pala.” itni wae = Rll (:-2/)01 - r)7P, 


1 2 1 = 
Pda = gb (1.— 2/01 — r)), gözə = Felt - 4-20) - FP, 


: 1 E 1 - 
Poss = gl — (1 — 279701 - ry), py = işi r 0-20) - oP 
(3.33) 


If possible, the doubled haploid technology can be followed to generate the DH 
population. Multiplication between the vector of genotypic frequencies (i.e., equa- 
tion 3.33) and the transition matrix for double haploids (i.e., equation 2.5 in 
chapter 2) will give the expected frequencies of four homozygous genotypes in the 
finally developed DH population, i.e., 3.34. 


)-DH 1 -DH.İ 
aim =+- 2N- nh, opt - ql - (1- 2/61 r), 


z Z 1 
pozma 2 (1 — zə)(1 - A hirs. -r)(1 - 8?) 


If wanted, repeated selfing can also be followed to generate the RIL population. 
Multiplication between the vector of genotypic frequencies (£.e., equation 3.33) and 
the transition matrix for repeated selfing (i.e., equation 2.6 in chapter 2) will give 
the expected frequencies of four homozygous genotypes in the finally developed RIL 
population, 2.e., 3.35. The readers are encouraged to choose one expected frequency 
as shown in equation 3.34 or equation 3.35 for confirmation and exercise. 


Oe. i ' 11 - (0-RıL _ 1 i (1—2r)(1— - 


PAABB .. 142r > F AAbb | 


-RL L 1 (1—2r)Q—r) "| -RL -I 14 (1 — 2r)(1 — e)”? 
PuBB “4 1--2r 5.mni 14-27 


(3.35) 


Starting from the F; population by crossing two pure-line parents, one progeny 
population can be developed after several generations of random mating. Once 
genotyping is done in the progeny population, pair-wise recombination frequencies 
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can also be estimated from theoretical frequencies as given in equation 3.33. When 
doubled haploid technology is applied to the progeny population and genotyping is 
done with the developed DH lines, recombination frequencies can be estimated from 
theoretical frequencies as given in equation 3.34. When repeated selfing is applied to 
the progeny population and genotyping is done with the developed RIL lines, 
recombination frequencies can be estimated from theoretical frequencies as given in 
equation 3.35. Therefore, linkage maps can also be constructed in those populations. 

In random mating populations, attention should be paid as regards how the 
generation is counted. Apply the transition matrix Tru (equation 3.27) on geno- 
typic frequencies given in equation 3.25 is actually equivalent to the production of 
randomized gametes to make the first generation of random mating, which should 
be treated as the start of a new generation. The readers may have realized that when 
t= 1, the frequencies of the four homozygotes are not equal to those in the 
F-derived DH population. After one generation of random mating (equivalent to 
one generation of selfing as far as F; is concerned), the Fp population is developed 
and then used to generate the DH population. Therefore, when £ — 1, the DH 
population given in equation 3.34 comes from the doubling of gametes produced by 
the Fə individuals, or the DH population is called F.-derived. When £ = 1, the RIL 
population given in equation 3.35 comes from the repeated selfing of the Fə indi- 
viduals, which has the same frequencies as the repeated selfing since the F; hybrid. 
This is again due to the equivalence between random mating and selfing with regard 
to the F: population. When t = 2 the RIL population given in equation 3.35 is no 
longer equivalent to the F:-derived RIL population. One more generation of random 
mating before the repeated selfing reduces the dis-equilibrium by (1 — r), resulting in 
larger accumulated recombination frequency and different genotypic frequencies 
from the Fı-derived RIL population. 


Exercises 


3.1 Assume three loci A, B and C are located at 0, 8 and 12.5 cM on one chromo- 
some, respectively. Use the Haldane mapping function to calculate the pair-wise 
recombination frequencies rap, Tec, and rac, and confirm the relationship 
rac = TAB + TBC — 2TABTBC. 


8.2 Shov that 
(1) When 6=0, equation 3.4 is equivalent to equation 1 — 2riş = (1 — 2riə) 


(1 = 2193). 
(2) When 6 = 1 — 2r, equation 3.4 is equivalent to equation ə B= oe x oe. 


(Hints: Let ô = 1 — 2riş in equation 3.4. It can be shown riş = unun) based 
on which at can be re-arranged accordingly.) 


3.3 Assume there are 21 markers evenly distributed on one chromosome with the 
marker interval 5 cM. QTL IciMapping software is able to simulate bi-parental 
populations, given the marker and gene information. In one simulated Fə with size 
200, all markers are co-dominant and no marker types are missing in genotyping. 
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In another simulated Fə with size 110, co-dominant, dominant, and recessive 
markers are present, and some marker types are missing (£.e., table 1.8 in chapter 1). 
Interference was not considered in the simulation, i.e., the coefficient of interference 
was equal to 0. Estimated recombination frequencies between marker i and 7 — 1, 
and between marker 7 and i — 2 are given below, where markers are arranged by 
their order on chromosome. Work out the estimated coefficient of interference 
between two intervals determined by markers i, i + 1 and i + 2, where i = 1-19. 


Fə population with 200 Fə population with 110 
individuals individuals 
Order 
Markers Markers Markers Markers 
iandi— 1 i and i- 2 iandi- 1 tand i- 2 
1 
2 0.0435 0.0381 
3 0.0488 0.0892 0.0090 0.0616 
4 0.0514 0.0920 0.0365 0.0785 
5 0.0540 0.1030 0.0381 0.0730 
6 0.0566 0.0917 0.0751 0.1062 
7 0.0513 0.0946 0.0278 0.0820 
8 0.0435 0.0862 0.0384 0.0670 
9 0.0357 0.0755 0.0332 0.0615 
10 0.0383 0.0755 0.0279 0.0632 
11 0.0435 0.0837 0.0474 0.0763 
12 0.0646 0.1114 0.0518 0.1014 
13 0.0754 0.1281 0.0664 0.1126 
14 0.0486 0.1054 0.0483 0.0980 
15 0.0754 0.1228 0.0847 0.1251 
16 0.0621 0.1255 0.0635 0.1327 
17 0.0435 0.0756 0.0389 0.0902 
18 0.0646 0.1059 0.0640 0.0997 
19 0.0330 0.0946 0.0188 0.1076 
20 0.0408 0.0645 0.0297 0.0588 
21 0.0434 0.0808 0.0388 0.0933 


3.4 Take the barley DH population in the QTL IciMapping software to practice the 
MAP functionality for genetic linkage map construction. 


(1) Construct linkage maps for the seven chromosomes in barley. 
(2) Output the constructed linkage maps. 
(3) Try to split the longest map into two fragments from the largest interval. 


3.5 IBM is a genetic population consisting of recombinant inbred lines in maize. In 
hybridization to make the F, hybrid, inbred B73 was used as the female parent; 
Inbred Mo17 was used as the male parent. Since Fy, four generations of random 
mating were conducted and then followed by repeated selfing until the new 
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generation of recombinant inbred lines was developed (Lee et al., 2002). Assuming 
the recombination frequency between two loci is 0.05, alleles at the loci are denoted 
by A and B in parent B73 and denoted by a and b in parent Mo17. 


1) Work out the haploid types of gametes produced by F}, their expected fre- 
quencies, and linkage dis-equilibrium between the two loci. 

2) Work out the linkage dis-equilibrium, expected gametic frequencies, and the 
accumulated recombination frequency after four generations of random mating 
since Fo. 

3) Work out the expected frequencies of 10 genotypes in the progeny population 
after four generations of random mating since F». 

4) Work out the transition matrix from the 10 genotypes at the two loci to their 
four homozygous progenies through repeated selfing. Use the transition matrix 
to work out the expected frequencies of four homozygous genotypes developed 
by four generations of random mating since Fə and followed by repeated selfing, 
a population that is similar to IBM in maize. 


3.6 Expected frequencies of 10 genotypes at two loci in P BCF; is given by, 


0 0 0 0 0 0 0 0 0 0 
p =| .. ə un Balan ə bila ba uz pu ee 
1 1 1 1 
= 1 
E r) 3” 0 i 5 (1 r 0 00 0 0 


(1) Show that the allelic frequencies are p4 = ł, pa = 1, pg = £ and p, = 4. 
(2) Show that the four gametes at the first generation of random mating have the 
following expected frequencies. 


(1 ni r)’, 


1 

4 
1 1 1 1 

Pup = gq Pw =l 


(3) Show that the gametic-phase dis-equilibrium at the first generation of random 
mating is given by, 
1 
Dı = i — 8r--4r”) 
(4) After £ + 1 (t = 0) generations of random mating, show that the gametic-phase 
dis-equilibrium D,,1 is given by, 


1 
Dig = TG — 8r+4r’)(1— r)” 
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(5) After t+ 1 (t2 0) generations of random mating, show that the four gametic 
frequencies are given by, 


9 3 
7 + Dist, pig? = “16 


PAB = I6 — Diss, 


(1) 3 1 1 
PaB 16 Dizi, əl “6 + Diri. 


(6) After t+ 1 (£ 5 0) generations of random mating, show that the 10 genotypic 
frequencies are given by, 


9 3 9 3 
t+1 t+1 
23: — Üz +D) Dug = = (5 +D) 5- Dist); 


9 3 
... Üz — Dist) ə = = (2 +D) Üz - Dua), 
(t 1) 9 1 (t 1) 3 2 
+ + 
ua =2(5 Dua) (Fe + Pisa) eh = 2 - Dea] , 


3 
o =2(F- Diss 


3 2 

(= +D); pie = = (-5-a) , 
t+1 2 

(5 +D); ... ”ı zz) : 


(7) After t + 1 (£ 5 0) generations of random mating, doubled haploid technology is 
followed. Show that the frequencies of four 2777 genotypes are given by, 


pit) = ) (4 — Dizi 


9 
paaBB = papat+(1—r)Di41 = 56 5- 8r-- 4r2)(1 — ryt! 


? 


3 1 
paaw = paps — (1—1)Dig1 = — — — (3 —8rt+4r\(1—1r)'*", 


DaaBB = DapB — (1—1r)Di41 = 


Daabb = Papo + (1 — 1) Di41 = 
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(8) After t+ 1 (t2 0) generations of random mating, repeated selfing is followed. 
Show that the frequencies of four homozygous genotypes are given by, 
9 3-8r+44r ; 
= 1— R)Di41 = 1 i 
PAABB = PAPpB + ( )Di+1 167 16(1 5-27) (Ler) 


3 3 —8r-47r? 


= — (1— R)D1 = 1— r)” 
PAA’ = papo — (1 — R)Di+1 16 16/12) | m3, 


3 3-8r+4r? 


aa. = Pa =(= D = 1 : 
PaaBB = Dapp — (1 — R)Di+1 16 16(1 + 2r) (=r), 


1 3-8r+4r? 


aabb = Pa dl D T 1 : 
Paabb Dado + ( R) t+1 167 16(1--2r) ( r) 


Chapter 4 


Single Marker Analysis and Simple 
Interval Mapping 


Phenotypes on most biological traits are caused by combining actions from both 
genetic and environmental factors. Generally speaking, genetic factors are sub- 
stances in the genome, namely genes, that can be transmitted from the parental 
generation to the progeny generation during propagation. Genetic factors other than 
the nuclear genes, such as cytoplasmic factors and epigenetics, are not considered in 
this book. Different traits are controlled by different numbers of genes. Although 
there are some traits that are controlled by only one or a few genes, most traits are 
controlled by multiple genes. Genes may interact with each other and may interact 
with environments either. Quantitative trait locus (QTL) mapping was originally 
developed for mapping complex and quantitative traits which are controlled by 
multiple genes, but the methodology is also suitable for genetic mapping on traits 
controlled by a few genes or the Mendelian traits controlled by single genes. Previous 
chapters were focused on genetic populations, linkage analysis, and the construction 
of genetic linkage maps based on polymorphic markers. Linkage maps allow 
geneticists to distinguish not only different chromosomes in the genome but also 
specific positions on one chromosome, such as long arm, short arm, centromere, and 
telomere. 

Therefore, if the genes on biological traits can be located by identifiable markers 
or located within specific marker intervals, they are actually mapped to particular 
positions on specific chromosomes. The process to locate individual QTLs on 
chromosomes and estimating their genetic effects on phenotypic traits is called 
QTL mapping. Since the interval mapping (IM) method was proposed in 1989 
(Lander and Botstein, 1989), QTL mapping has gradually become the major task in 
quantitative genetic studies (Zhang et al., 2008, 2010, 2012, 2015b, 2017; Thomas, 
2010; Li et al., 2007, 2008; Holland, 2007; Carlborg et al., 2003; Barton and 
Keightley, 2002; Broman and Speed, 2002; Kao et al., 1999; Zeng et al., 1999; Xu, 
1998; Doerge and Rebai, 1996; Tanksley and Nelson, 1996; Whittaker et al., 1996; 
Zeng, 1994; Haley and Knott, 1992). Successful stories have been reported on 
map-based cloning of quantitative trait genes pending on QTL mapping results, 
and indirect selection on phenotypic traits pending on the closely linked markers 
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(Wan et al., 2005, 2006, 2008; Remington et al., 2001; Frary et al., 2000). Introduced 
in this chapter are single marker analysis and simple interval mapping. Simple 
interval mapping and interval mapping are used alternatively in this book, in 
comparison with the more advanced methods where the background genetic varia- 
tion is controlled to improve the detection power of individual QTLs, to be intro- 
duced in the following chapters. 


4.1 Single Marker Analysis 


QTL mapping is built on the difference in phenotypic means observed in different 
groups or classes of marker genotypes in a genetically segregating population. The 
first such study on the association between phenotypic traits and identifiable 
markers was published by Sax (1923), where the association between seed color and 
seed weight was reported in beans (Phaseolus vulgaris). Table 4.1 shows the average 
seed weights of different color groups in two Fy populations together with their four 
parents, modified from table 2 in Sax (1923). Seeds of female parents 1YE1310 and 
TYE1317 are uniformly self-colored with yellow pigment around the hilum and 
covering 1/4-1/3 of the surface of the seed, whereas the seeds of male parents W1333 
and W1228 are white. The seed weights of female parents are about twice as much as 
the weights of male parents. In two Fy populations, seed color is divided into four 
classes (table 4.1). In both F populations, the average weight from the colored seeds 
is significantly higher than that of the white seeds, and therefore seed weight and 
seed color are associated. The use of individual genetic markers in linkage analysis 
on quantitative traits is called the single marker analysis, which is suitable for 
situations when there are only a few markers available and a complete genetic 
linkage map cannot be constructed. 


4.1.1 Phenotypic Means of Different Genotypes at One 
Marker Locus 


If one marker locus (two alleles denoted by M and m) is linked with one QTL that 
controls a phenotypic trait (two alleles denoted by Q and q), the identifiable marker 
types (i.e., MM, Mm, and mm) as groups will have different frequencies for the three 
QTL genotypes (i.e., QQ, Qq, and qq). When QTL genotypes perform differently, 
the three marker genotypes will follow different distributions by phenotypic obser- 
vations. If the marker locus is not linked with any QTL, QTL genotypes will have 
the same frequencies in the three marker classes, and therefore the three marker 
types follow a similar phenotypic distribution. Figure 4.1A and B show the observed 
phenotypic distributions of three identifiable marker genotypes in the case of linkage 
and no linkage, respectively. In figure 4.1A, random error cannot fully account for 
the observed difference in the three distributions, indicating that the marker is 
linked with QTL. In figure 4.1B, random error can fully account for the observed 
difference, indicating that the marker is not linked with any QTL. Therefore, testing 
whether the different marker genotypes have the same distribution or not can 


Tas. 4.1 — Average seed weight (unit: x10? g) in four parents and different classes of seed color in two F> populations, modified from table 2 in 
Sax (1923). 


Female parent Seed weight Male parent Seed weight Class of seed color in F» population (in parentheses is sample size) 


Mottled Self Eyed White 
TYE1310 56 + 0.5 W1333 28 + 0.9 39.1 + 0.4 (150) 36.5 + 0.5 (51) 39.0 0.4 (68) 33.8 + 0.4 (80) 
IYE1317 48 + 0.5 VV1228 21 + 0.2 28.8 + 0.4 (82) 28.6 + 0.6 (44) 31.3 + 1.1 (12) 264+ 0.5 (41) 


Surddeyy yeAsəşuy əydurig pue sisAyeuy Jaye apsurg 


IVI 
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A marker B marker 
genotype MM genotype Mm 


marker 
genotype mm 


marker 
genotype MM 


marker 
genotype Mm marker 
genotype mm 


Fic. 4.1 — Distribution of phenotypic values for three genotypes at one marker locus. Notes: 
(A) The marker is linked with QTL affecting trait; (B) The marker is not linked with any QTL 
affecting the trait. 


determine whether the marker is linked with QTL. In the following sections, the 
t-test will be introduced first followed by ANOVA and likelihood ratio test. 

Take the doubled haploid (DH) population as an example to illustrate the 
principle of single marker analysis in QTL mapping. Assume the genotypes of two 
parents are MMQQ (Pı) and mmqq (Pə). In the DH population, two marker types 
are MM and mm, and two QTL genotypes are QQ and qq. There are four genotypic 
combinations when the marker and QTL are considered jointly, i.e., MMQQ, MMqq, 
mmQQ, and mmqq. Assuming the recombination frequency between the marker and 
QTL is r, frequencies of the four genotypes are functions of r, i.e., 2(1 — r), 57,57 
and 1(1 — r), respectively. Under the one-locus additive and dominant model, 


phenotypic means (u) of the four genotypes are represented by equation 4.1. 
HMMQQ = HmmqQ =u+a, HMMqq — Ummaq =—~U—a (4.1) 


The two phenotypic means on QTL genotypes differ in quantity and sometimes 
only slightly. It is generally impossible to tell the QTL genotype of each DH line 
from its phenotype. However, the marker type of each DH line can be clearly 
determined from genotypes MM and mm. If the marker type of one DH line is MM, 
it is unclear whether its QTL genotype is QQ or qq. However, if the DH lines with 
marker type MM are regarded as one group, frequencies of QQ and qq in this group 
can be determined; that is, the frequency of QQ is 1 — r, and the frequency of qq is 
r (table 4.2). Therefore, the phenotypic mean of marker type MM can be given in 
equation 4.2. 


Umm = (1— r)Lumag TT THUMMqq 
=(1—r(ut+a)+r(u— a) — u+ (1 —2r)a 


(4.2) 


For the other group of DH lines with marker type mm, the frequency of QQ is 
r and the frequency of qq is 1 — r. Therefore, the phenotypic mean of marker type 
mm can be given in equation 4.3. 


Umm = TUmmQQ + (1 öz Tam 


(4.3) 
r(u+ a) +(1—r)(u— a) =~ (1 — 2r)a 
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It can be seen from equations 4.2 and 4.3 that if the marker is linked with a 
quantitative trait gene, phenotypic means of two marker types, i.e., Hym and u 
would be different (table 4.2), and the difference is given in equation 4.4. 


Hmm — Umm = 2(1 — 2r)a (4.4) 


mm? 


TAB. 4.2 — Frequencies of QTL genotypes and phenotypic means for two identifiable marker 
types in DH population. 


Marker Frequency QTL genotype Phenotypic mean 
type QQ (genotypic qq (genotypic 

value = u + a) value = 4 — a) 
MM 1-r r Uum = M+ (1- 2r)a 
mm r tey Umm = H— (1 = 2r)a 


As can be seen from equation 4.4, there would be no difference between marker 
types MM and mm when there is no linkage between marker and QTL, i.e., r = 0.5. 
A significant difference between the two marker groups would indicate the linkage 
between the marker and QTL. Based on this principle, by calculating the estimates 
of Umm and LU», together with their variances, a f-statistic can be built to test the 
significance of the difference between the two marker types. A significant difference 
indicates the linkage between the marker and QTL; otherwise, the marker is not 
linked with QTL. In the case of multiple QTLs, the difference between two marker 
types is represented by equation 4.5. 


Hum — Hmm = 5 (1 — 2rı)a (4.5) 
7 


where r, is the recombination frequency between the bth QTL and the marker; a, is 
the additive genetic effect of the bth QTL. As can be seen from equation 4.5, if the 
marker is not linked with any QTL, no difference would be observed between the two 
marker types. If the marker is linked with multiple QTLs, both positive and negative 
effects can be present. In this situation, no difference may be observed either, which 
is a shortcoming of the single marker analysis method. 


4.1.2 Single Marker Analysis by t-Test in Populations 
with Two Genotypes 


For single marker analysis, the mapping population can be firstly divided into 
several sub-populations (or classes or groups) based on genotypes at the marker 
locus. If there is no linkage between the marker and QTL, these sub-populations 
should have the same mean and the same variance in theory. If there is a linkage 
between the marker and QTL, different means (sometimes variances as well) would 
be observed. In the DH population for example, kaya, and fmm are the estimates of 
phenotypic means, and Sin, and S 


mm 


are the estimates of phenotypic variances in 
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the two marker groups. Sample sizes of marker types MM and mm are denoted by 
Num and Nmm, respectively, and degrees of freedom in sample variance is denoted by 
dfum and dfmm, respectively, which are equal to the corresponding sample sizes 
minus one. Assuming the two marker groups have the same variance, sample vari- 
ances Siyy and S2, 
equation 4.6. 


n can be combined to acquire the same variance as shown in 


— dum X Su + dfmm X Sem 
dfum + dfmm 


Equation 4.7 shows the tstatistic which can be used to test the significance of 
phenotypic mean difference between two marker types in the DH population. 


s? 


(4.6) 


i= HMM — Hmm ag) t( dfum + dfmm) - 


1 4 1 192 
Cr Ri +) 5 


When variances of the two marker types are heterogeneous, no accurate t-test 
can be built into statistics. In this case, the Aspin—Welch approximate test can be 
used. The approximate t-test statistic and its approximate degree of freedom are 
given in equations 4.8 and 4.9, respectively. 


t= EAL Een o tdf) (4.8) 


mm 
NMM Nmm 


32. s2 \2 
1 (Su) je = (H) 
dfum \ nmm "dim \ nmm 

Genotypic data for 14 markers on chromosome 1H in the barley DH population 
has been given in figure 1.7 in chapter 1. Average kernel weights in the 145 DH lines 
are shown in table 4.3, where the DH lines are arranged in the same order as that in 
figure 1.7. Take two markers, i.e., Act8A and Act8B, as examples to illustrate the 
application of single marker analysis in the DH population. 

Marker locus Act8A is located on chromosome 1H of barley. Two 
sub-populations at Act8A show similar distributions on kernel weight (figure 4.2A), 
indicating that there may be no obvious association between Act8A and kernel 
weight, and thus Act8A maybe not linked or in a long genetic distance with genes on 
kernel weight in the barley population. Act8B is located on chromosome 5H, and 
two sub-populations show quite different distributions on grain weight (figure 4.2B), 
indicating the association between Act8B and kernel weight. Therefore, Act8B may 
be linked with genes on kernel weight. 

Sample sizes are 70 and 74 for two types 0 and 2, respectively, at locus Act8A; 
one DH line has a missing marker type at this locus. Phenotypic means of 
kernel weight are 42.23 mg and 42.79 mg for the two marker types (table 4.4). 


df = (4.9) 
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TAB. 4.3 — Kernel weights (unit: mg) of the 145 DH lines derived from a cross between two 
homozygous parents (averages across replications and environments are given). 


DH line 1 2 3 4 5 6 T 8 9 10 
1-10 41.01 40.41 40.11 40.28 41.50 45.75 40.20 44.10 42.09 45.66 
11-20 44.06 40.88 42.52 41.58 47.22 43.97 40.67 41.02 44.40 42.20 
21-30 44.07 45.04 39.76 41.32 41.13 48.46 47.20 42.44 44.19 43.38 
31-40 39.55 39.04 40.78 45.98 42.21 41.20 40.08 46.62 42.46 40.61 
41-50 43.02 46.08 43.30 43.51 40.93 41.57 42.39 46.10 44.15 42.36 
51-60 42.58 39.15 46.47 37.52 45.44 39.67 41.43 42.16 38.33 43.58 
61-70 46.62 42.35 42.17 43.11 39.17 4469 43.67 42.32 43.06 42.76 
71-80 38.76 42.69 41.18 39.39 42.54 41.10 41.84 43.88 43.32 43.85 
81-90 44.39 42.76 40.79 41.91 39.32 40.67 40.58 41.99 45.30 42.93 
91-100 41.86 39.91 44.39 46.45 41.81 43.21 45.46 41.37 44.35 39.49 
101-110 45.53 40.75 46.55 43.82 42.38 42.11 40.70 42.79 41.49 41.27 
111-120 39.55 44.84 43.16 41.28 42.49 46.13 41.22 42.79 4251 43.01 
121-130 43.18 42.55 41.85 36.45 40.91 44.99 43.72 37.69 42.85 42.67 
131-140 45.20 42.41 43.22 46.04 41.65 40.30 39.71 43.75 41.43 46.61 
141-145 42.11 40.63 43.47 39.14 43.75 


A. Marker Act8A B. Marker Act8B 
----- Type 0 ----- Type 0 
0.4 0.4 - zi, 
-—El-- Type 2 oy ----Type2 / “A 
B03 2” 5 03 1. 
8 m / 1 & 
2 02 Q F 02 .:.— 
əl ie os S m 4 é ı 4 
0.1 A 20) 0.1 ff À b 
0.0 s o İrə a o 
35 37 39 41 43 45 47 49 35 37 39 41 43 45 47 49 
Kernel vveight in the middle of each group Kernel weight in the middle of each group 


Fic. 4.2 — Frequency distributions of kernel weights for two marker types at locus Act8A 
(A) and locus Act8B (B) in the barley DH population (see figure 1.7 in chapter 1). 


The tstatistic is equal to 1.51 and its significance probability is equal to 0.13, which 
does not reach the significance level of 0.05. Therefore, marker Act8A can be 
declared to be not linked with kernel weight genes or to have a long mapping 
distance with those genes, which confirms what has been observed from the phe- 
notypic distributions in figure 4.2A. Sample sizes are 58 and 69 for two types 0 and 
2, respectively, at locus Act8B; 18 DH lines have missing marker types at this locus. 
Phenotypic means of kernel weight are 43.89 mg and 41.25 mg for the two marker 
types (table 4.4). The tstatistic is equal to 8.37 and its significance probability is 
much lower than the significance level of 0.01. Therefore, marker Act8B can be 
declared to be closely linked with kernel weight genes, which also confirms what has 
been observed from the phenotypic distributions in figure 4.2B. 
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Tas. 4.4 — Significance test for the difference in kernel weight (mg) between two types at two 
marker loci in the barley DH population. 


Parameter Marker locus Act8A Marker locus Act8B 

Type 0 Type 2 Type 0 Type 2 
Sample size 70 74 58 69 
Degree of freedom 69 73 57 68 
Sample mean 42.23 42.79 43.89 41.25 
Sample variance 4.45 5.32 3.53 2.79 
Standard deviation 2:11 2.31 1.88 1.67 
Combined variance 4.90 3.13 
t-statistic 1.51 (P = 0.13) 8.37 (P = 1.00 x 10715) 


4.1.3 Single Marker Analysis by t-Test in Populations 
with Three Genotypes 


Take F, and co-dominant markers as an example to illustrate the t-test method in 
single marker analysis in populations with three genotypes. Three marker types are 
represented by MM, Mm, and mm, and three QTL genotypes are represented by 
QQ, Qq, and qq. By taking locus A as marker and locus B as QTL, theoretical 
frequencies as given in table 2.5 in chapter 2 are also applicable to the nine geno- 
types when one marker and one QTL are considered jointly. Under the one-locus 
additive and dominant model, the phenotypic means of nine genotypes at the two 
loci are given in equation 4.10. 


HMMQQ = HümqQ = Hmm = U+ a, 
HüMqq = Humqq = mmo = E+ d, (4.10) 


HuMqq HauMmqq Himmqq Ha 


The expected frequencies of the three marker types in the Fy population are 0.25, 
0.5, and 0.25, respectively. Theoretical frequencies of three QTL genotypes in each 
marker type class can be acquired from the nine genotypic frequencies in table 2.5 
when divided by the expected frequency of marker type. The frequencies of three 
QTL genotypes are given in table 4.5, from which the three phenotypic means can be 
calculated (last column in table 4.5). If the marker type of an Fə individual is known 
to be MM, whether the individual has QTL genotype QQ, Qq or qq cannot be clearly 
determined. However, when all individuals with marker type MM are regarded as a 
group, frequencies of the three QTL genotypes in this group as given in table 4.5 
depend on the recombination frequency between marker and QTL, i.e., frequency of 
QQ is (1— r), frequency of Qq is 2r(1 — r), and frequency of qq is r? (table 4.5). 
Therefore, the phenotypic mean of the MM group can be calculated by equation 4.11. 


uum = (1 — r} ugo +2r(1 — iliq, “r” Hag 


= u+ (1 — 2r)a--2r(1 — r)d en 


TAB. 4.5 — Expected frequencies of QTL genotypes and phenotypic mean in each marker type group at one co-dominant locus in Fə population. 


Marker Frequency of QTL genotypes Phenotypic mean 
type QQ (genotypic Qq (genotypic qq (genotypic 
value = u + a) value = u + d) value = u — a) 
MM (1 — r)” 2r(1 — r) r? lum = x ( — 2r)a EF .2r(1 — r)d 
Mm r-n) ( — 2r+2r?) r(1-7) Him = + (1 — 20+ 242) 


mm r? 2r(1 — r) (1 — r)” Ham, = H— (1 — 2r)a EP 2r(1 — r)d 
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Similar to equation 4.11, phenotypic means of the Mm and mm groups can be 
obtained by equations 4.12 and 4.13, respectively. 


Umm = H — (1 — 2r)a EF 2r(1 — r)d (4.13) 


Relationship between the three phenotypic means calculated in equations 
4.11—4.13 can be shown as follows, i.e., 


Hum — Umm = 2(1 — 2r)a (4.14) 


1 
Hm — z Hun + Hmm) = (1 — 2r)”d (4.15) 


Equation 4.14 only contains the additive effect of the QTL; equation 4.15 only 
contains the dominant effect of the QTL. Therefore, if the marker locus is linked 
with QTL on the phenotypic trait, the difference between Mayı, and Hm can be used 
to test the additive effect of the QTL, and the difference between py, and 
$ (Hmm + Hmm) can be used to test the dominant effect of the QTL. 

Let îmm, Lm and Îimm represent sample means of the three marker types, and 
Stars Siim and SŽ, represent the sample variances. Sample sizes of the three marker 
types are represented by nmm, Num; and Nmm, and degrees of freedom in sample 
variances are represented by dfmm, dfum, and dfmm, respectively, where the degree of 
freedom is equal to the corresponding sample size minus one. Assuming that dif- 
ferent marker types have the same variance, the three sample variances can be 
combined to acquire the same variance as shown in equation 4.16. 


— dfum x Sum + dfum x Shim + dfmm x Smm 


s? 
dfum + dfum + dlma 


(4.16) 


The t-statistic test of the significance of the phenotypic mean difference between 
Lym and Hmm is the same as that in equation 4.7. The #statistic test of the signif- 
icance of the difference between Hym and $ (Hmm + Hmm) is given in equation 4.17. 


fim — 5 (la + İlmm) 


1 1 1 2 
(aa + Mm + nk) 5 


The Fə population with 110 individuals as given in figure 1.8 in chapter 1 is used 
here as an example to demonstrate the calculation of t-statistic in Fə populations. For 
convenience, table 4.6 gives the phenotypic values of 110 individuals in the same 
order as that used in figure 1.8. Two co-dominant markers, i.e., M1-8 (see figure 1.8) 
and M4-1 (not shown in figure 1.8), are located on two different chromosomes. For 
locus M1-8, three marker types show an obvious difference in their phenotypic dis- 
tributions (figure 4.3A), indicating that M1-8 may be linked with genes on the 
phenotypic trait. For locus M4-1, three marker genotypes show similar distribution 


t= ~ t(dfum + dfum + dfmm) (4.17) 
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TAB. 4.6 — Phenotypic values of 110 F> individuals derived from a cross between two 
homozygous inbred parents (simulation data). 


Individual 1 2 3 4 5 6 7 8 9 10 


1-10 18.18 18.06 18.36 16.29 19.68 20.71 16.98 21.75 22.06 18.74 
11-20 19.82 20.32 19.32 19.74 20.38 19.39 19.79 20.39 19.35 19.84 
21-30 20.27 19.40 21.39 21.45 17.56 20.60 20.72 19.92 1831 19.95 
31-40 16.13 20.29 20.99 17.72 18.88 22.41 20.38 23.23 20.75 20.16 
41-50 20.19 17.42 19.49 17.44 18.08 19.60 17.13 17.10 18.21 19.66 
51-60 21.04 21.21 18.92 20.59 20.06 23.20 20.97 20.36 20.68 21.18 
61-70 21.16 18.16 21.39 19.99 22.31 23.19 19.45 19.87 21.70 18.95 
71-80 19.29 18.00 20.40 21.34 20.09 19.21 19.62 18.52 17.63 19.78 
81-90 20.66 17.06 20.27 20.82 18.76 20.13 18.16 17.22 19.00 21.21 


91-100 19.49 17.46 19.34 21.63 20.49 18.52 16.50 19.37 22.21 17.77 
101-110 18.55 17.69 18.90 20.88 20.18 19.88 21.08 20.69 18.93 17.52 


A. Marker M1-8 B. Marker M4-1 
---@--TypeB --£}--TypeH ---A--- Type A --6--- Type B --}-- Type H ---A--- Type A 
6 


Frequeney 


15 17 19 21 23 25 15 17 19 21 23 25 
Mid-group phenotypic value Mid-group phenotypic value 


Fic. 4.3 — Frequency distributions of phenotypic trait for three marker types at locus M1-8 
(A) and locus M4-1 (B) in an F, population (see figure 1.8 in chapter 1). 


(figure 4.3B), indicating that M4-1 maybe not be linked or far away by genetic 
distance from genes on the phenotypic trait. 

At locus M1-8, sample sizes of marker types mm, Mm, and MM (i.e., types B, H, 
and A in table 4.7) are equal to 29, 50, and 28, respectively; three individuals have 
missing types at the locus (figure 1.8 in chapter 1). Phenotypic means on the trait 
are equal to 18.62, 19.76, and 20.43, respectively (table 4.7). The tstatistic to test 
the additive effect is equal to 4.89, reaching the significance level of 0.001; the 
tstatistic to test the dominant effect is equal to 0.87, below the significance level of 
0.05. Therefore, locus M1-8 can be declared to be linked with QTL affecting the 
phenotypic trait. The linked QTL has a significant additive effect, but the dominant 
effect is not significant. At locus M4-1, sample sizes of types A, H, and B are equal to 
25, 52, and 32, respectively; one individual has missing types at the locus. 
Phenotypic means on the trait are equal to 19.55, 19.83, and 19.44 (table 4.7). The 
tstatistic to test the additive effect is equal to 0.27, below the significance level of 
0.05; the tstatistic to test the dominant effect is equal to 1.11, below the significance 
level of 0.05 either. Therefore, locus M4-1 can be declared to be not linked with QTL 
affecting the phenotypic trait. 
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Tas. 4.7 — Significance test of the difference in phenotypic means of the three marker types at 
two loci in an F, population (figure 1.8 in chapter 1). 


Parameter Marker locus M1-8 Marker locus M4-1 

TypeB TypeH TypeA TypeB TypeH TypeA 
Sample size 29 50 28 25 52 32 
Sample mean 18.62 19.76 20.43 19.55 19.83 19.44 
Sample variance 1.69 2.21 1.82 2.43 2.07 2.96 
Standard deviation 1.30 1.49 1.35 1.56 1.44 1.72 
t-test for additive 4.89 (P = 5.49 x 1077) 0.27 (P = 0.79) 
t-test for dominance 0.87 (P = 0.54) 1.11 (P = 0.38) 


4.1.4 ANOVA in Single Marker Analysis in Populations 
with Three Genotypes 


When there are multiple genotypes in the population and the sampling variances 
can be assumed to be homogeneous, the F-statistic in the analysis of variance 
(ANOVA) can be used to test whether the phenotypic means are equal or not. 
Assume in one F, population with size n, sample sizes of three marker types MM, 
Mm, and mm are m, nə, and ng, respectively. The observed phenotypic value of 


individual j in marker group 7 is denoted as Yj, that is, 


Yy ~ N(u,, 0°), wherei = 1,2,3; / =1,2,..., ni 


Let Y;. = “5 Y; and Y= . ər Yi. The total sum of squares SS rp 
can be denbtaposed as follows, 


3 3 3 
-YS Yj- Y) =X n(Y 7-93 (Yy - Vi)’ = SS + SSe, 


where SS, is the sum of squares of marker type effects with a degree of freedom 
equal to 2 (i.e., one less than the number of genotypes), and SS, is sum of squares of 
error effects with a degree of freedom equal to n — 3. The total degree of freedom is 
equal to n — 1, i.e., one less than the population size. SS and SS, divided by their 
degrees of freedom are called the mean square of marker effects and mean square of 
errors, denoted as MSy and MS,, respectively. The null hypothesis in the test is 
Ho : fy = My = u3. It can be proven in statistics that when the null hypothesis Hp is 
true, 


MS 
az 


MS, 
~ y’ (2) and Fo ~y (n-=3) 


Therefore, an F-statistic can be defined in equation 4.18 and then used to test 
the significance of null hypothesis Ho. 


MSm 


F= 
MS, 


~ F(2,n—3) (4.18) 
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For the data in table 4.7, the F-value at marker M1-8 is 14.61 (P = 2.55 x 107), 
reaching the significance level at 0.001, indicating again the three marker classes 
have significantly different phenotypic means, and M1-8 is closely linked with genes 
on the phenotypic trait. The F-value at marker M4-1 is 0.96 (P = 0.3844), below the 
significance level at 0.05, indicating again the three marker classes have 
non-significantly different phenotypic means, and M4-1 is not linked with genes on 
the phenotypic trait. Obviously, the conclusions here from ANOVA are the same as 
those previously obtained by t-test (table 4.7). 


4.1.5 Likelihood Ratio Test in Single Marker Analysis 


For two normally distributed populations, the t-test is often used and is also highly 
effective in testing the significance of the difference between two population means. 
When there are more than two populations and their variances are homogeneous, 
ANOVA and F-test can be used. For more complex situations when the variances are 
heterogeneous, a more general method, i.e., likelihood ratio test (LRT), has to be 
used. In fact, LRT has been used in testing the linkage relationship between two loci 
in chapter 2. 

Take again F» and co-dominant markers as an example to illustrate the LRT 
statistic in single marker analysis. Assuming that the Fə population has size n, and 
the phenotypic value of the ith individual is represented by random variable Y;. 
Sample sizes of three marker types are nmm, Num, aNd Nmm. By re-arranging all 
individuals by their marker types, individuals from 1 to nmm are assumed to be MM, 
individuals from nmm + 1 to nmm + Nam are assumed to be Mm, and individuals 
from num + Mum + 1 to nmm + Num + Nmm (=N) are assumed to be mm. The null 
hypothesis Hp and its alternative hypothesis Hy in the test are, 


Ho : Hum = Lum = Umm 


HA : at least two of the three means are not equal to each other 


Under the condition when Hy is true, the samples are distributed as follows, 


2 ; 
Y; a N(uyy. Tnm) ?— 1, 2, +++) NMM; 
2 : . 
Yi ~ N( Umm, Cum): t= nmm +1, nu + 2,..., NMM + NMm; 
2 A 
Yi ~ N(Umm Onm), i= num + nmm + 1, num “nim + 2,..., NMM + NMm + Mmm(= N) 


Therefore, the maximum likelihood estimates of mean and variance for the three 
marker types are, 


1 NMM NMM 


>, Yi, un = — (y= İm): 


NMM “J NMM “J 


Hu 7 
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1 "m + num 1 mm + num 5 
A _ 002 : A : 
Lum = n X Yi, oun = n X ( Y; = Em) ? 
Mm nun +1 Mm nun +1 


n n 


Hom = . 5 Yi, zə = : 5 (Y; m Bimm) 


M i= q Nmm — 
i=nmM +nmMm +1 nu + nMm + 1 


The probability density function of the normal distribution N (u, o”) is denoted 


by 
mes” 
. 2zo2 20? 


If the observed value of the random variable Y; is represented by y; the maxi- 
mum value of the likelihood function under the H; is given by equation 4.19. 


NMM NMM + NMm n 


max L( (HA) = ıl f( Yi; Hum: öh) II F (yas İlim: itm) II S (yeh Hmm: O ô? m) 


i=nmm +1 i=nmm + nMm + 1 


(4.19) 
Under the null hypothesis Ho, all samples follow the same normal distribution, i.e., 
Yerə N(u, 00), i= 1,2,.. n 


Maximum likelihood estimates of mean and variance are, 


-53> Yi and öp = 3 — lo)” 
i=l 


Maximum value of the likelihood function under Hp is therefore, 
max L( Ho) -Jī ~ Yi; Hos öv) (4.20) 


Thus, the LRT statistic can be defined by equation 4.21. 


max L( Ho) 


LRT = —21 ~ 
a “max L(Ha) 


z (af) (4.21) 

Obviously, H4 has six independent parameters and Hp has two independent 
parameters to be estimated from the samples. In the case of large sample size, the 
LRT statistic as defined by equation 4.21 approximately approaches a chi-square 
distribution, and the degree of freedom of the chi-square distribution is equal to 
the difference between the number of independent parameters in H4 and the number 


of independent parameters in Hp. Therefore, the degree of freedom is equal to 4 for 
the chi-square distribution in equation 4.21. In another test, if three marker types 
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can be assumed to have equal variance, estimates of three phenotypic means remain 
unchanged, but the estimate of the common variance is different. In fact, an estimate 
of the common variance can be calculated by equation 4.16. Tn this situation, H4 has 
four independent parameters, Hp has two independent parameters, and the degree of 
freedom of the LRT statistic is equal to two. 

Similar to the significance test on linkage relationship in chapter 2, the LOD 
score as given in equation 4.22 can also be used to test the difference in phenotypic 
means. If the ratio of two maximum likelihoods is 10, LOD = 1; if the ratio is 100, 
LOD = 2. Therefore, the LOD score reflects the ratio of two maximum likelihoods 
more intuitively. From their definitions, the relationship between LOD score and 
LRT statistic can be acquired and given in equation 4.23. 


max L(H4) 
D = log,){ -K$———” 4.22 
xu a 
LRT LRT 


For the phenotypic data given in table 4.7, when three marker types are assumed 
to have equal variance, the LOD score is 5.77 for M1-8 and the corresponding LRT 
statistic is equal to 26.55 (df = 2, P = 1.72 x 1077), indicating a highly significant 
linkage relationship between M1-8 and genes on the phenotypic trait. LOD score is 
0.44 for M4-1 and the corresponding LRT statistic is equal to 2.04 (df= 2, 
P = 0.3598), indicating again a non-significant linkage relationship between M4-1 
and genes on phenotypic trait. 


4.1.6 Problems with Single Marker Analysis 


As has been seen from equations 4.4, 4.14, and 4.15, the difference between phe- 
notypic means of marker types is affected by the genetic effects of linked QTL, and 
genetic distance between the marker and the QTL. The observed difference may be 
caused by one QTL with larger effects but weaker linkage, or one QTL with smaller 
effects but stronger linkage. In other words, single marker analysis cannot separate 
the QTL genetic effects from the linkage distance. In the meantime, the observed 
difference may also be caused by the joint action of two or more linked QTLs. In this 
case, it becomes more difficult to distinguish the effects of QTL and the distance 
between QTL and the marker. Therefore, single marker analysis can make the 
correct estimation of QTL effects only when the marker and QTL are completely 
linked. In addition, there is no background control in single marker analysis. In 
sub-populations composed of individuals with different marker types, genetic vari- 
ance caused by unlinked QTL (also called background genetic variance) is also 
included in their phenotypic variances, in addition to random errors. If included, 
background genetic variation will increase the sampling variances and therefore 
reduce the testing power of single marker analysis. 
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4.2 Simple Interval Mapping 


Single marker analysis can estimate the genetic effects of QTL correctly only when 
the QTL is completely linked to the marker. If crossovers happened between the 
marker and QTL, the method cannot separate the linkage distance from genetic 
effects. Lander and Botstein (1989) proposed an interval-based mapping method 
that simultaneously considers two adjacent markers on the chromosome to locate 
QTLs and estimate their genetic effects. Interval mapping conducts the 
one-dimensional scanning step by step on each chromosome in the genome repre- 
sented by the constructed linkage map. When determining whether there is a QTL 
in a specific chromosomal position, two adjacent markers on the left side and right 
side of the current position are considered jointly. The possibility of one QTL being 
located at the scanning position is indicated by the LOD score. The profile of the 
LOD score can therefore be obtained for each chromosome, and peaks in the profile 
exceeding a pre-specified threshold value are regarded as QTLs. 


4.2.1 Frequencies of the QTL Genotypes in a Marker 
Interval 


Assume two homozygous parents show polymorphism at two linked loci (called A 
and B, or left and right markers), two alleles at locus A are represented by A and 
a (i.e., downward triangles in figure 4.4), and two alleles at locus B are represented 
by Band b (i.e., upward triangles in figure 4.4). Two alleles of the QTL (called locus 
Q) located between the two markers are represented by Q and q (i.e., circles in 
figure 4.4). Recombination frequency between the left marker and QTL is denoted 
by rp, between QTL and the right marker is denoted by rg, and between the two 
markers is denoted by r. Under the assumption that crossovers happened inde- 
pendently, the relationship of the three recombination frequencies can be expressed 
by equation 4.24, which is actually the same as equation 3.2 in chapter 3. 


r= n+ — 2mm (4.24) 


As no additional markers can be used between locus A and locus B, gametes 
produced by non-crossover and double-crossover happened in the marker interval 
are indistinguishable due to the same haploid type of generated gametes as far as 
loci A and B are concerned. In fact, all even-numbered crossovers produce the 
gametes having the parental haploid type; all odd-numbered crossovers produce the 
gametes having the recombinant haploid type. However, the probability for two 
times or more crossovers to happen in a limited chromosome interval is very low and 
therefore not considered in the estimation of recombination frequency. 

Consider jointly two marker loci and one QTL which is located between the two 
markers, and ignore two times and more crossovers happened between the left 
marker and QTL, and between QTL and the right marker. The F, hybrid will 
generate gametes with eight haploid types, which can be classified into four cate- 
gories based on the number of crossovers and the interval where the crossover 
happened (figure 4.4). (1) No crossover has happened. The probability is equal to 
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Fic. 4.4 — Schematic representation showing the crossover events happened in the chromo- 
somal interval with one QTL located between two markers in one DH population. Notes: 
Downward triangles represent the position of the left marker on the chromosome, upward 
triangles represent the position of the right marker, and circles represent the position of QTL. 
The solid shape represents the allele from one parent (i.e., P1), and the un-filled shape repre- 
sents the allele of the other parent (i.e., P2). Recombination frequency between the left marker 
and QTL is represented by rr, and between QTL and the right marker is represented by rp. 


(1 — m)(1 — rp), and the two types of gametes thus produced have equal frequency, 

5(1—1)(1 — m). (2) One crossover happened between the left marker and 
QTL. The probability is equal to m,(1 — rr), and the two types of gametes thus 
produced have equal frequency, i.e., bn (1 — rr). (3) One crossover happened 
between QTL and the right marker. The probability is equal to (1 — ry,)rR, and the 
two types of gametes thus produced have equal frequency, i.e., }(1—r,)rr. (4) Two 
crossovers have happened, one between the left marker and QTL, and the other one 
between QTL and the right marker. The probability is equal to rrr, and the two 
types of gametes thus produced have equal frequency, i.e., ire TR- 

Gametes generated by the F, hybrid are doubled to have the DH population 
(figure 4.4). As QTL genotypes cannot be observed, eight genotypes in the DH 
population can be classified into four groups by marker types. The joint frequencies 
of QTL genotypes in each marker group are given in table 4.8, and the sum of two 
frequencies of QTL genotypes is equal to the frequency of the corresponding marker 
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type in the DH population. For example, for marker type AABB, the sum of 
genotypic frequencies of QTL is, 


05 
21:40 


(1 — r — tere) ==- r- me +2) =- r) 


Nl = 
Nile 


Therefore, the conditional (or marginal) frequencies of two QTL genotypes in 
each marker group can be acquired, i.e., being equal to the joint frequencies divided 
by the frequency of the corresponding marker type. 


TAB. 4.8 — Expected frequencies of four marker types and the joint frequencies of two QTL 
genotypes in the DH population. 


Left Right Sample Frequency QTL genotype 
marker marker size of marker QQ qq 
type 
1 1 1 
AA BB m 20 -r) 20-m-mümnmn) zam 
1 1 1 
AA bb n 3” 20 mim z(= m) 
1 1 1 
aa BB ng a 9 (1 — m) 2(1- m 
1 1 1 
aq bb ny .. ə LR 201-m-mni-nin) 


Gametes generated by the F, hybrid are randomly combined to have the Fə 
population with 27 genotypes when the three loci are considered jointly. QTL 
genotypes cannot be observed, the 27 genotypes in the Fy population can be clas- 
sified into nine groups by marker types. The joint frequencies of three QTL geno- 
types in each marker group are given in table 4.9, and the sum of the three 
frequencies of QTL genotypes is equal to the frequency of the corresponding marker 
type in the Fə population. For example, for marker type AABB, the sum of geno- 
typic frequencies of QTL is, 


TA- nA- mtina- mml- ni) + Err 
= ae — m)(1 — m) + ure)? 
=i- (rL +r — 2n.m)l” =iq- r)? 


Therefore, conditional frequencies of the three QTL genotypes can be acquired, 
i.e., being equal to the joint frequencies divided by the frequency of the corre- 
sponding marker type. 

Assume that markers A and B are located at 10 cM and 30 cM on the linkage 
map of one chromosome, and the current scanning position is at 16 cM. Using the 
Haldane mapping function, the recombination frequencies between markers A and 
B, marker A and locus Q, and locus Q and marker B are given below, respectively. 


TAB. 4.9 — Expected frequencies of nine marker types and the joint frequencies of three QTL genotypes in the Fə population. 
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1 : 

r= H- e~ (30-10)/50) — 0,1648, 
il (16-10) /50 

m=5ll—e ] = 0.0565, 
1 — (30-16) /50 

m=, +e ] = 0.1221 


Based on table 4.8, frequencies of the two marker genotypes in the DH population, 
and both joint and conditional frequencies of QTL genotypes can be calculated, which 
are given in table 4.10. The frequency of marker type depends only on the recombi- 
nation frequency between two markers. However, the frequency of the QTL genotype 
also depends on the relative position of QTL in the interval. Marker types AABB and 
aabb represent the two parental genotypes. In the group of marker type AABB, QQ is 
the major QTL genotype; qq comes from the gametes generated by double-crossovers 
and therefore has a low frequency (table 4.10). In contrast, qq is the major QTL 
genotype in the group of marker type aabb (table 4.10). Marker types AAbb and aaBB 
represent the two recombinant types. The linkage distance of the QTL is 6 cM to the 
left marker, and 14 cM to the right marker. One crossover between the two markers 
has a larger probability to occur in interval Q-B than that in interval A-Q. Therefore, 
gamete type AQ6 has a higher frequency than Aqb, and gamete type aq has higher 
frequency than aQB. Genotype QQ also has a higher frequency than qq in marker 
group AAbb; genotype qq also has a higher frequency than QQ in marker group aaBB 
(table 4.10). 


Tas. 4.10 — Frequencies of four marker types and two QTL genotypes in the DH population. 


Marker type Frequency Joint frequency Conditional frequency 
QQ qq QQ qq 
AABB 0.417580 0.414128 0.003452 0.991733 0.008267 
AAbb 0.082420 0.057602 0.024818 0.698885 0.301115 
aaBB 0.082420 0.024818 0.057602 0.301115 0.698885 
aabb 0.417580 0.003452 0.414128 0.008267 0.991733 


Note: Marker A, locus Q, and marker B are located at 10 cM, 16 cM, and 30 cM on the 
linkage map of one chromosome, respectively; the conditional frequency of the QTL genotype 
is equal to the joint frequency divided by the frequency of marker type. 


Assume that phenotype of QQ follows normal distribution N(3, 1), and the 
phenotype of qq follows normal distribution N(5, 1). Therefore, the additive effect of 
the QTL is one-time the standard deviation. In the DH population, it has been seen 
from table 4.10 that each marker type group is a mixture distribution composed of 
two QTL genotypes, occurring at specific frequencies. Figure 4.5 shows the distri- 
butions of four marker types together with two QTL genotypes as components 
included in each marker class at the scanning position. It can be seen that the 
difference in QTL distributions and their frequencies makes the difference in the 
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Fic. 4.5 — Distributions of four marker types together with the two QTL genotypes as 
components included in each marker class at one scanning position in DH population. Notes: 
At the current scanning position, locus A (i.e., the left marker), locus Q, and locus B (i.e., the 
right marker) are located at 10 cM, 16 cM, and 30 cM on the chromosome, respectively. QQ 
has a normal distribution N(3, 1), and qq has a normal distribution N(5, 1), therefore the 
additive effect is one time of the standard deviation. 


phenotypic distributions of the four marker groups. On the other side, the distri- 
butions of two QTL genotypes can be inferred from the distributions of four marker 
groups, representing the underlying principle of the interval-based method in QTL 
mapping in DH mapping populations. 

At the same scanning position, table 4.11 shows the joint frequencies of three 
QTL genotypes and their conditional frequencies in each of the nine marker groups 


TAB. 4.11 — Frequencies of nine marker types and three QTL genotypes in the Fə population. 


Marker Frequency Joint frequency Conditional frequency 


type QQ Qq qq QQ Qq qq 
AABB 0.174373 0.171502 0.002859 0.000012 0.983535 0.016397 0.000068 
AABb 0.068834 0.047709 0.020953 0.000171 0.693107 0.304403 0.002489 


AADbb 0.006793 0.003318 0.002859 0.000616 0.488440 0.420890 0.090670 
AaBB 0.068834 0.020556 0.047881 0.000398 0.298626 0.695597 0.005777 
AaBb 0.362332 0.005718 0.350896 0.005718 0.015782 0.968436 0.015782 
Aabb 0.068834 0.000398 0.047881 0.020556 0.005777 0.695597 0.298626 
aaBB 0.006793 0.000616 0.002859 0.003318 0.090670 0.420890 0.488440 
aaBb 0.068834 0.000171 0.020953 0.047709 0.002489 0.304403 0.693107 
aabb 0.174373 0.000012 0.002859 0.171502 0.000068 0.016397 0.983535 


Note: Marker A, locus Q, and marker B are located at 10 cM, 16 cM, and 30 cM on the 
linkage map of one chromosome, respectively; the conditional frequency of the QTL genotype 
is equal to the joint frequency divided by the frequency of marker type. 
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in the F» population. Taking marker group AABB as an example, genotype 
AAQQBB is produced by the combination of two non-crossover gametes with the 
haploid type AQB. Type AQB occurs at a higher frequency, i.e., $(1 — m)(1-— rp), 
in the gamete population, and QQ has the largest frequency in marker group AABB. 
Genotype A AQqBB is produced by the combination of one non-crossover gamete 
with haploid type AQB and one double-crossover gamete with haploid type AgB. 
The frequency of the double-crossover gamete AqB is much lower, i.e., $TLTR, in the 
gamete population. Therefore, the frequency of Qq in marker group AABB is lower 
than that of QQ. Genotype AAqqBB is produced by the combination of two 
double-crossover gametes with genotype AqB. Therefore, frequency of qq in marker 
genotype AABB is extremely low in marker group AABB. 

Assume that phenotype of QQ follows normal distribution N(3, 1), the pheno- 
type of Qq follows normal distribution N(5, 1), and the phenotype of qq follows 
normal distribution N(5, 1). Therefore, the additive effect of the QTL is one-time the 
standard deviation, and the dominant effect is equal to the additive effect. In the Fə 
population, it has been seen from table 4.11 that each marker type group is a 
mixture distribution composed of the three QTL genotypes, occurring at specific 
frequencies. Figure 4.6 shows the distributions of nine marker types together with 
three QTL genotypes as components included in each marker class at the scanning 
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Fic. 4.6 — Distributions of nine marker types together with the three QTL genotypes as 
components included in each marker class at one scanning position in F population. Notes: At 
the current scanning position, locus A (i.e., the left marker), locus Q, and locus B (i.e., the 
right marker) are located at 10 cM, 16 cM, and 30 cM on the chromosome, respectively. QQ 
has a normal distribution N(3, 1), Qq has a normal distribution N(5, 1), and qq has a normal 
distribution N(5, 1), therefore both additive and dominant effects are one times the standard 
deviation. 
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position. It can be seen that the difference in QTL distributions and their frequencies 
makes the difference in the phenotypic distributions of the nine marker groups. 
On the other side, the distributions of three QTL genotypes can be inferred from the 
distributions of nine marker groups, representing the underlying principle of the 
interval-based method in QTL mapping in Fə mapping populations. 


4.2.2 Maximum Likelihood Estimation of Phenotypic 
Means of QTL Genotypes 


In addition to maximum likelihood, the most commonly used statistical method in 
parameter estimation, other methods based on regression analysis have also been 
proposed in QTL mapping (Whittaker et al., 1996; Haley et al., 1994; Wright and 
Mowers, 1994; Haley and Knott, 1992; Knott and Haley, 1992; Martinez and 
Curnow, 1992). This book is focused on the maximum likelihood method. Assuming 
that the individuals or families included in a mapping population can be classified 
into m groups by their identifiable marker types, the observed phenotype on a 
quantitative trait in each marker group is represented by Yj, where — 1, ..., m, 
j=1,..., m, and n; is the sample size of the ith marker group. The QTL has a 
number of q genotypes in the population, following a normal distribution with mean 
ur (k = 1, ..., q) and common variance o”. The conditional frequency of the k” QTL 
genotype in the ith marker group is denoted by m; and phenotype Y; in each 
marker group is composed of a mixture distribution (McLachlan, 1988) with the 
q QTL genotypes as components, which is shown in equation 4.25. 


Yy XO mg N (uy, 07), 1 1,...,mand? —1,...,m, (4.25) 


where n,; is the sample size of the ith marker group. The probability density function 
of the normal distribution N(u, o?) is represented by f(y|u, o”), which is given in 
equation 4.26. 


2 1 C 
2 2 ) (4.26) 


Let Y = (Y),) represent the vector composed of random variables on phenotypes, 
and y = (yi) represent the vector composed of phenotypic observations. The likeli- 
hood function of the samples and the logarithm of the likelihood are given below, 
respectively. 


Lp a Hg o?lY = y) = Il 5 Tint (gylur, o?), and 


(4.27) 
İn Hu, PY =y) = 5 n( 5 272 
: : b 
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It is impossible to work out the maximum likelihood estimates directly on 
equation 4.27, which is constructed from the mixture distribution of observed 
samples, i.e., equation 4.25. A two-step approximate algorithm, which is called 
Expectation and Maximization (EM), has to be used here. The algorithm actually 
has been used in the estimation of recombination frequency in chapter 2. The EM 
algorithm is an efficient method for analyzing incomplete data (Dempster et al., 
1977). As indicated by the name, the algorithm consists of the expectation and 
maximization steps. Given the initial values on solutions, the first step is to calculate 
the conditional (or posterior) probabilities of the observed sample and convert the 
incomplete sampling data into complete data. The second step is to rebuild 
the sampling likelihood function based on the converted data and re-calculate the 
solutions. The newly calculated solutions are used as initial values for the next 
iteration of the two steps. In QTL mapping, the unknown QTL genotype in each 
sample is the incomplete data. Due to its importance in QTL mapping, given below 
are a detailed description of the algorithm in QTL mapping. 

At the beginning of the EM algorithm, a set of initial values have to be specified 
for the parameters to be estimated, i.e., up (k= 1, ..., q) and o”, which are repre- 
sented by uz (k= 1,..., q) and o°), respectively. For example, in the DH popu- 
lation, most DH lines with marker type AABB have the QQ genotype; most DH 
lines with marker type aabb have the qq genotype (table 4.10). Therefore, the sample 
mean of the AABB lines can be set as the initial value of 4; the sample mean of the 
aabb lines can be set as the initial value of u; the sample variance of the DH 
population can be set as the initial value of o”. In the Fy population, most indi- 
viduals with marker type AABB have the QQ genotype; most individuals with 
marker type AaBb have the Qq genotype; most individuals with the marker type 
aabb have the qq genotype (table 4.11). Therefore, the sample mean of the AABB 
individuals can be set as the initial value of uı; the sample mean of the AaBb 
individuals can be set as the initial value of u2; the sample mean of the aabb indi- 
viduals can be set as the initial value of u3; the sample variance of the Fy population 
can be set as the initial value of o”. 

E-step (expectation step): Given the initial values of parameters to be estimated, 
the expected probability that the sample Y; = yy belongs to each QTL genotype 
can be calculated, i.e., wp in equation 4.28. The conditional probability given by 
equation 4.28 is also called the posterior probability. That is to say, when the 
components included in the mixture distribution (equation 4.25) are known, prob- 
abilities that each sample comes from different component distributions are known 
as well. 


0 
nafla PO ol 


Wijk = yesəm, 4 =Le k= Lag (428) 


M-step (maximization step): The sample y;; is split into different QTL genotypes 


by its posterior probabilities given by equation 4.28. The likelihood function and the 
logarithm likelihood are re-established as equation 4.29. 
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Hn, Y= y= İL: TE (ob. em))” 


(4.29) 


Maximum solutions in equation 4.29 can be directly worked out by letting the 
partial derivative be equal to 0, which are given in equation 4.30. 


> i—1.... m: Wijk Vi) i—1....,m: Wak (Yap = wr? 
(1) = $= Lyn Mh paj ox) - j=1,...,ni 4.30 
Mr y —1.....m: Wijk ” : 0, y i=1,...,m; Wijk ( ) 
şin, Jalsa 


Take the solutions acquired by equation 4.30 as new initial values, and repeat 
the previous two steps until a pre-defined accuracy is reached, e.g., the absolute 
difference in likelihood function between two succeeding iterations is lower than 
1077, The final solutions at end of the EM algorithm are the maximum likelihood 
estimates of parameters u, (k = 1, ..., q) and o”, denoted by ji, (k= 1, ..., q) and 
67, respectively. 

More specifically, the DH population is used below to further illustrate the 
implementation of the EM algorithm in QTL mapping. Firstly, DH lines are 
classified into four marker groups according to their marker types, i.e., AABB, 
AAbb, aaBB, and aabb. Taking AAbb as an example, the DH lines having marker 
type AAbb compose of a group with two QTL genotypes QQ and qq. From 
table 4.8, theoretical frequencies of two QTL genotypes in the AAbb group can be 
acquired, which are denoted by zı and zə (a, + m> = 1). From the point view of a 
population, a frequency of zı DH lines in this group has genotype QQ, and a 
frequency of zə DH lines has genotype qq. Of course for a specific DH line, its QTL 
genotype is either QQ or qq, which is unknown before QTL mapping. Assuming 
the phenotype is Y — y for one DH line, Y is therefore a mixture of two normal 
distributions N(u,, o?) and N(uş, o?) with frequencies zı and zə, respectively, i.e., 
equation 4.31. 


Y ~mN(th, 0°) + aN (ty, 0?) (4.31) 


Therefore, given the distributions of two components, the probability density 
function of the observed value Y = y is given in equation 4.32. 


E(yluy, Ma, 0°) = mf (yli, o”) + məf(uluz, o”) (4.32) 


The density function is given by equation 4.32 consists of two parts. The first 
part represents the possibility that the observed value y comes from distribution 
N (u, 0°), and the second part represents the possibility from distribution N (u, o”). 
Relative values of the two parts to the mixture distribution density are called 
posterior probabilities of the sample, represented by un and we, respectively, and 
shown in equation 4.33. 
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Taf (yl er, o?) 


OT ref (ules o?) + maf (ya, 65)” (4.33) 
T maf (ylu, o°) 
mif (ylur, 07) + nf (yluz, o?) 


In theory, DH lines in the AAbb marker group have the proportion zı to be 
genotype QQ, and proportion zə to be genotype qq. This does not mean that every 
DH line would have the probability mı to be genotype QQ and probability zə to be 
genotype qq. It can be seen from equation 4.33 that the posterior probabilities of 
each DH line being QQ and qq also depend on the observed value y of the DH line. 

Assume again that marker A, locus Q, and marker B are located at 10 cM, 
16 cM, and 30 cM, respectively, on the linkage map of one chromosome. The phe- 
notype of QQ has normal distribution N(3, 1), and the phenotype of qq has a normal 
distribution N(5, 1), as shown in figure 4.7. It can be seen from table 4.10 that the 
two QTL genotypes, as components, have frequencies mı = 0.6989, and zə = 0.3011 
in the AAbb marker group. For one DH line with an observed value Y = 3, its 
likelihood function is given as follows, 


L(Y = 3lh, = 3, uş = 5,0? = 1) = 0.6989f(3|3, 1) 4-0.3011/(3l5, 1) 
= 0.2788 + 0.0163 = 0.2951 


Values 0.2951, 0.2788, and 0.0163 in the equation correspond to probability 
densities of the mixture distribution, and the two components QQ and qq, respec- 
tively (figure 4.7). For one DH line with Y = 3, posterior probabilities un and wə 
belonging to QTL genotypes QQ and qq can be calculated as follows, 

2 .01 
w = ia = 0.9449, w = nl = 0.0551 (4.34) 

Therefore, the DH line with phenotypic value Y= 3 has the probability 
w, = 0.9449 being QQ, which is greater than frequency z,, and the probability 
w = 0.0551 being qq, which is lower than frequency zə. The DH line should be 
classified into genotype QQ, based on the principle of classification by Bayesian 
posterior probabilities. As the increase in observed values, the probability belonging 
to genotype QQ gradually decreases, and the probability belonging to genotype aa 
increases (figure 4.7). For DH lines with phenotypic values around 4.4, the two 
probabilities are close to 0.5. After this point, the probability being qq becomes 
higher than the probability being QQ. According to the principle of classification, 
those DH lines with phenotypic values above 4.4 should be classified to have the 
QTL genotype qq. 

Marker types of individuals or lines composed of the mapping population are 
known in QTL mapping. Between any two neighboring markers, all lines in the DH 
population are firstly classified according to their marker types. It has been seen 
from equation 4.33 and figure 4.7 that, given the phenotypic distributions of two 
QTL genotypes QQ and qq, posterior probabilities of DH lines belonging to geno- 
types QQ and qq can be calculated, and each DH line can be assigned to the one with 
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Fic. 4.7 — Calculation of the posterior probabilities of QTL genotypes in DH population. 
Notes: At the current scanning position, locus A (i.e., the left marker), locus Q, and locus B 
(i.e., the right marker) are located at 10 cM, 16 cM, and 30 cM on the chromosome, 
respectively. QQ has normal distribution N(3, 1), and qq has normal distribution N(5, 1). 


larger posterior probability. However, the distribution parameters of QQ and qq are 
unknown and are to be estimated by the iterative EM algorithm. Initial values of 
unknown parameters are needed before the iteration starts. It can be seen from 
table 4.10 and figure 4.5 that most DH lines with marker type AABB have genotype 
QQ, and the sample mean of the AA BB group can be used as an initial value for the 
phenotypic mean of QQ. Most DH lines with marker type aabb have genotype qq, 
and the sample mean of the aabb group can be used as an initial value for the 
phenotypic mean of qq. The sample variance of the mapping population can be used 
as an initial value for the phenotypic variances of two QTL distributions. 

Interval mapping conducts the whole-genome scanning to seek QTL, during 
which the relative positions of current scanning and two flanking markers are 
known. Equation 4.33 can be used to calculate posterior probabilities for all sam- 
ples. For example, at the scanning position as shown in table 4.10 and figure 4.7, one 
DH line with sample value y = 3 has a posterior probability w, = 0.9449 belonging 
to genotype QQ and posterior probability un = 0.0551 belonging to genotype qq. 
The DH line is therefore split into two genotypes QQ and qq by posterior proba- 
bilities. That is to say, 0.9449 of the DH line belongs to genotype QQ, and 0.0551 of 
the DH line belongs to genotype qq. The likelihood function for the DH line can be 
re-written as the multiplication of two QTL genotypes, i.e., 


L(m, tə, PIY = y) = Flm, P" Ole, eol" (4.35) 


Every DH line has a likelihood function similar to equation 4.35, and therefore 
the likelihood function on all DH lines can be acquired, as has been given in 
equation 4.29. Distribution parameters can be re-calculated by equation 4.30 and 
then used as new initial values for the next iteration. The two steps in the EM 
algorithm are alternatively repeated until the estimated values or likelihoods 
between two succeeding iterations approach a given accuracy. Obviously, it is much 
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easier to work out the solutions of the likelihood function given by equation 4.29, in 
comparison with the likelihood given by equation 4.27. 


4.2.8 Testing for the Existence of QTL 


A significant difference between phenotypic means zu, (k = 1, 2, ..., q) from different 
QTL genotypes indicates the presence of QTL. Null hypothesis Hp and alternative 
hypothesis FA in the test are given below, 


Ho : py = +++ = My = Hand 


HA : 4, . -Hq at least two of them are not equal to each other (4.36) 


In the previous section, the EM algorithm has been used to calculate the max- 
imum likelihood estimates of distribution parameters, by which the maximum 
likelihood value can also be calculated when Hy is true. When Ho is true, all samples 
follow the same distribution, namely, 


Yy ~ N (uo, oz) (4.37) 


where ? = 1, ..., m, j= 1, ..., m, and n; is the sample size of the ith marker group. 
The likelihood function and logarithm of the likelihood are written as follows, 
respectively. 
L(up, eolY = y) = II fUulin, 0); 
Ekti 


(4.38) 
In (aş, |Y = y) = ” In f (yil Ho, op) 


Ekl 


Maximum likelihood estimates of the mean and variance can be obtained by 
equation 4.39. 


a \2 
y i=1,...,m; Yij y i=1,...,m; (Yij - ito) 
eee 4el,...,n, a = del... (4.39) 


By replacing equation 4.38 with the maximum likelihood estimates in equa- 
tion 4.39, the maximum likelihood under Hp can be obtained and given in equa- 
tion 4.40. By replacing equation 4.29 with the maximum likelihood estimates in 
equation 4.30 at end of the EM algorithm, the maximum likelihood under H4 can be 
obtained and given in equation 4.41. Likelihood ratio test (LRT) statistic from the 
two hypotheses and its asymptotical distribution are shown in equation 4.42. 


max L(Hp) = L(jig, 32İY = y) (4.40) 


max D(H.) = D(jty,.-+5fty,6°1¥ = y) (4.41) 
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max L( H 
LRT = Se ~i2(df = q—1) (4.42) 

There are q+ 1 independent parameters under HA and two independent 
parameters under Hp. In the case of large samples, the LRT statistic asymptotically 
approaches the y? distribution, and the degree of freedom is equal to the difference in 
the number of independent parameters between the two hypotheses, that is, 
df = q — 1. Therefore, the significance of the difference in phenotypic means can be 
tested. Similar to the test on linkage relationship in chapter 2, the LOD score can 
also be used in QTL mapping, which is calculated by equation 4.43. Equation 2.24 
gave the relationship between LRT statistics and LOD score. 


(4.43) 


LOD Se, (= am) 


max L( Ho) 


4.2.4 Estimation of Genetic Effects of QTL 
and Its Contribution to Phenotypic Variance 


As has been seen previously, at any scanning position, maximum likelihood esti- 
mates of phenotypic means of different QTL genotypes can be calculated by the EM 
algorithm. In the DH population, the relationship between phenotypic means pu; 
(k= 1, 2) and the grand mean u and QTL additive effect a is given below. 


H — Ha, =p- a 


Therefore, the grand mean u and QTL additive effect a can be calculated from 
the maximum likelihood estimates of two phenotypic means, i.e., equation 4.44. 


1 


ə (lu — Âo) (4.44) 


2 lee uu 
he 5 (i + fy), @=5 


2 

In the Fs population, the relationship between phenotypic means zz, (k = 1, 2, 3) 

and the grand mean u, QTL additive effect a, and QTL dominant effect d are given 
below. 


Hı uma, fy = y+ d, py =H- a 


Therefore, the grand mean vu, additive effect a, and dominant effect d can be 
estimated from the maximum likelihood estimates of three phenotypic means, i.e., 
equation 4.45. 


57 İv no 
(lu + bs), a= 5 (iy — İlş), d = by ” gül + hs) (4.45) 


According to the properties of maximum likelihood estimation, estimates given 
in equations 4.44 and 4.45 are also the maximum likelihood estimates of the cor- 
responding genetic parameters. These estimates have the same unit as measured on 
the phenotypic trait. Genetic effects of QTL can be positive or negative, depending 
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on the distribution of alleles in two parents. To facilitate the comparison of QTL 
identified on different traits and in different populations, it is necessary to use a 
relative and unit-less index, which is called the phenotypic variance explained 
(PVE) by the QTL or contribution of the QTL (R°). In this book, PVE is mostly 
adopted to represent the relative contribution of a QTL to phenotypic variation. 
PVE of one QTL is defined in equation 4.46 as the percentage of phenotypic vari- 
ance explained by the genetic variance of the QTL. 


V 
PVE = -£ x 100% (4.46) 
Vp 


where Vg is the genetic variance caused by the QTL, and Vp is the phenotypic 
variance of the phenotypic trait. In fact, phenotypic variance Vp is equal to the 
estimate of variance under the null hypothesis Ho, i.e., equation 4.39. When 
segregation distortion is not considered, genetic variances of the QTL in the DH and 
Fə populations are given in equation 4.47, respectively. 


Voor = @, and Vor, = 5a + m (4.47) 

Segregation distortion occurs in most practical mapping populations to various 
extents. Let the frequencies of QQ and qq in the DH population be fgg and fo, 
respectively, and the frequencies of QQ, Qq, and qq in the Fz group be fgg, faq, and 
fog respectively. Population mean and variance can be calculated respectively by 
equations 1.11 and 1.13 in chapter 1. Genetic variances caused by the QTL in DH 
and Fə populations are given in equation 4.48, respectively. 


Vopn) = 4feafoqa”, and 


4.48 
Vor) = og + faa — (fea — faa) | a” — 2faa(feg — faq) 4 + (Fog — HALS ( ) 


In equation 4.48, frequencies of QTL genotypes can be acquired from the pos- 
terior probabilities at end of the EM algorithm, namely, 


k= 5 Wik, k — 1,...,q (4.49) 
t= leami 
j= 1155 Ty 


where 


= Tif (Yigl itis 6”) i=1 
ük — x AY? € yesə 
7 Pima Bal (isl i, 6”) 


şəxş 


4.2.5 Applications of Simple Interval Mapping in DH 
and F, Populations 


In the barley population consisting of 145 DH lines, figure 4.8 gives the LOD score 
profile and additive effect profile from one-dimensional scanning on kernel weight. 
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The scanning step is set at 1 cM. Marker types on chromosome 1H for the 145 DH 
lines are given in figure 1.7 in chapter 1, and their kernel weights are given in 
table 4.3. When the LOD threshold is set at 2.5, there is one significant peak located 
on chromosome 5H and two peaks located on 7H. There are peaks on other 
chromosomes, but LOD scores at these peaks are below the threshold value. 
Positions of the three significant peaks can be regarded as positions of three QTLs 
identified by the simple interval mapping method. 


CNARDWDONA 
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LOD score 
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One-dimensional scanning on the seven chromosomes in barley, step 1cM 


w 


Additive effect 


One-dimensional scanning on the seven chromosomes in barley, step 1cM 


Fic. 4.8 — LOD score profile (A) and additive effect profile (B) from one-dimensional scan- 
ning for kernel weight on seven chromosomes in the barley DH population. 


The major purpose of QTL mapping is to determine the positions of individual 
QTLs on the chromosome and estimate their genetic effects. Table 4.12 shows 
exactly such information for the three QTL on kernel weight identified in the barley 
DH population. The first four columns give information on positions, and the last 
three columns give the information on genetic effects of the identified QTLs. 
Markers most closely linked with QTL can be used in marker-assisted selection in 
breeding, by which the favorable genes on phenotypic traits can be indirectly 
selected by their linked markers. LOD score and PVE are independent of the 
measurement unit on phenotypic trait and direction of QTL effects, and therefore 
are suitable in comparing and summarizing the QTL mapping results more broadly. 


Tas. 4.12 — Three QTL on kernel weight identified by simple interval mapping in one barley 
DH population. 


Chromosome Position Nearest left Nearest right LOD PVE Additive 


(cM) marker marker score (%) effect 
5 4 Act8B MVVG502 13.05 36.96 —1.30 
T 0 dRpgl iPgd1A 2.55 8.48 —0.62 
7 98 VAtp57A MVVG571D 5.36 17.17 “—Ü.89 


Additive effect not only reveals the contribution of one QTL to the phenotypic 
trait but also can be used to determine the parental source of the favorable allele. In 
one bi-parental genetic population, each identified QTL has two alleles to be located 
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with. Depending on the breeding target and how the phenotypic values are recorded, 
higher phenotypic values may be favored for some traits, such as yield, oil content, 
protein content, and disease resistance. But for other traits, such as growth period 
and lodging, lower phenotypic values may be favored. There are also some traits, 
such as plant height, and amylose content in rice, phenotypic values may be favored 
at an intermediate level, i.e., not too low but not too high either. Therefore, which 
allele is favorable and which one is non-favorable depend on the breeding target of 
the phenotypic trait. 

In the DH population, the average kernel weights of parents Harrington and 
TR306 are 38.7 mg and 45.0 mg, respectively. In QTL mapping, code 2 represents 
the genotype of Harrington, i.e., QQ with phenotypic mean 4 = u+ a; code 0 
represents the genotype of TR306, i.e., qq with phenotypic mean uş = u—a. m 
breeding for high-yielding, higher kernel weight is generally favored. If the additive 
effect of one QTL is positive, the allele carried by Harrington, i.e., the parent coded 
by 2, will increase the kernel weight; the allele carried by TR306, i.e., the parent 
coded by 0, will reduce the kernel weight. On the contrary, if the additive effect is 
negative, the allele in Harrington will reduce the kernel weight; the allele in TR306 
will increase the kernel weight. Three QTLs given in table 4.12 all have negative 
additive effects on kernel weight, indicating that the alleles to increase kernel weight 
come from the parent coded by 0, i.e., TR306 with the higher kernel weight. 

In the Fy population consisting of 110 individuals, figure 4.9 gives LOD score and 
genetic effect profiles from the one-dimensional scanning. The scanning step is set at 
1 cM. Marker types on the first chromosome for the 110 individuals are given in 
figure 1.8 in chapter 1, and their phenotypic values are given in table 4.6. When the 
LOD threshold is set at 2.5, there are several significant peaks on the first and third 
chromosomes. Table 4.13 shows positions and genetic effects at two peaks with the 
highest LOD scores. Dominant effects at the two peaks are relatively small in 
comparison with their additive effects, indicating that the phenotypic trait may be 
controlled by two independent QTLs with additive effects as the major source of 
genetic variation. 


LOD score 


One-dimensional scan on four chromosomes in an F, population, step size 1 cM 
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— Additive effect ------- Dominant effect 
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o 


One-dimensional scan on four chromosomes in an F, population, step size 1 cM 


Fic. 4.9 — LOD profile (A) and genetic effect profile (B) from one-dimensional scanning of 
four chromosomes in an Fə population containing 110 individuals. 
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TAB. 4.13 — Two QTLs identified by simple interval mapping in one Fə population. 


Chromosome Position Left Right LOD PVE Additive Dominant 
(cM) marker marker score (70) effect effect 

1 31 M1-8 M1-9 6.27 20.41 1.03 0.22 

3 28 M3-5 M3-6 9.97 30.36 1.22 —0.02 


4.2.6 Phenomenon of “Ghost” QTL in Simple Interval 
Mapping 


When Lander and Botstein proposed simple interval mapping firstly in 1989, the 
authors made the assumption that no more than one QTL is located on each 
chromosome, i.e., the linked QTLs are not considered. However, such an assumption 
is too strong in genetics! When two QTLs are indeed linked in one chromosome and 
their genetic effects are in the same direction (also called the coupling linkage), it is 
always hard to see two peaks on the LOD profile around the true QTL positions. 
Instead, one significant peak may appear in the middle of two QTLs, and therefore 
the linked QTLs cannot be correctly detected by simple interval mapping. The 
falsely detected QTL between two true QTL is called ‘ghost’ QTL (Martinez and 
Curnow, 1992; Wright and Mowers, 1994; Zeng, 1994; Haley and Knott, 1992). 
When the linked QTLs have genetic effects at opposite directions (also called 
repulsive linkage), the LOD score becomes low and no significant peaks can be 
observed on the LOD profile. In this case, neither of them can be detected by simple 
interval mapping. 

One simulated DH population is used here as an example to further illustrate the 
problem (figure 4.10). The genome consists of six chromosomes, each of 120 cM in 
length. One QTL is located at 35 cM on the first chromosome with an additive effect 
of 1. Two QTLs are linked at 35 cM and 68 cM on the second chromosome, and their 
additive effects are both equal to 1. Two QTLs are linked at 35 and 68 cM on the 
third chromosome with additive effects of 1 and —1, respectively. No QTL are 
located on other chromosomes. The broad-sense heritability of the phenotypic trait 
is set to 0.8. There is one and only one QTL on the first chromosome, which rep- 
resents the case of independent inheritance. Two QTLs located on the second 
chromosome have additive effects in the same direction, representing the case of 
coupling-phase linkage. Two QTLs located on the third chromosome have additive 
effects in opposite directions, representing the case of repulsive-phase linkage. 
Marker genotypes and phenotypic values of 200 DH lines are generated by the 
simulation functionality implemented in QTL IciMapping software. The scanning 
step in simple interval mapping is set at 1 cM. 

On the LOD profile, one peak can be clearly seen on the first chromosome close 
to the true QTL position at 35 cM (figure 4.10A), and the additive effect at the peak 
is close to the true additive effect of 1 (figure 4.10B). This indicated that simple 
interval mapping can be efficient in mapping independent QTL. Several peaks are 
present on the second chromosome, and the highest one occurs between the two 
QTLs linked in the coupling phase (figure 4.10A). If only one QTL can be reported 
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Fic. 4.10 — LOD profile (A) and additive effect profile (B) from simple interval mapping 
when both independent and linked QTLs are included in a simulated population with 200 DH 
lines. Notes: The genome consists of six chromosomes, each of 120 cM in length. One QTL is 
located at 35 cM on the first chromosome with the additive effect of 1. Two QTLs are located 
at 35 and 68 cM on the second chromosome, both with the additive effect of 1. Two QTLs are 
located at 35 and 68 cM on the third chromosome, with additive effects 1 and —1, respectively. 
The broad-sense heritability of the phenotypic trait is equal to 0.8, and the scanning step is 
1 cM. Scanning positions pointed by arrows on the z-axis represent the true positions of 
QTLs. The length of the arrow is proportional to the size of the QTL effect. The upward arrow 
indicates the positive additive effect, and the downward arrow indicates the negative additive 
effect. 


from the second chromosome, the QTL will be mapped in the middle of two true 
QTL. In addition, the estimated additive effect at the highest peak will be much 
higher than either of the true effects (figure 4.10B). The LOD score on the third 
chromosome is rather low. No significant peaks can be observed and therefore no 
QTL can be declared (figure 4.10). Therefore, the genetic linkage between QTLs 
makes large impact on simple interval mapping. In the case of coupling linkage, a 
“ghost” QTL can be detected in the middle of two linked QTLs. In the case of 
repulsive linkage, neither of the linked QTLs can be detected. 


4.2.7 Other Problems with Simple Interval Mapping 


While testing the presence of QTL in one marker interval, simple interval mapping 
does not have any control over potential QTLs located outside the current interval. 
It has been seen from §4.2.6 that simple interval mapping cannot separate the linked 
QTLs properly, let alone the estimation of their genetic effects. Even though the 
QTLs are located on different chromosomes, the detection power is low especially for 
QTLs with smaller genetic effects, due to the large sampling variance included in 
phenotypic distributions of different marker groups. In addition, one QTL can make 
a wide range of influences on the chromosome on which the QTL is located. 
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In chromosomal regions that are far from the QTL, the test statistic can still be high 
and may still exceed the threshold value. This may not be relevant when there is at 
most one QTL on each chromosome, since the only QTL is most likely to be located 
at the position where the test statistic is maximized. However, the problem with 
practical mapping populations is that the number of QTLs on each chromosome is 
always unknown in advance. When two or more QTLs are linked on one chromo- 
some, these QTLs will interact with each other during the testing, causing the bias 
in LOD scores, and QTL positions and effects. Another problem is that only two 
markers are used for the testing in each marker interval. 

Take another simulated DH population as an example (figure 4.11). Figure 4.11 
gives the LOD score and additive effect profiles from simple interval mapping on the 
simulated population. The genome consists of six chromosomes, each of 120 cM in 
length. Five QTLs are located at 35, 35, 68, 35, and 68 cM on the first five chro- 
mosomes. No QTLs are located on the last chromosome. Four QTLs have additive 
effects equal to 1, and one QTL has an additive effect equal to —1 (figure 4.11). 
Marker genotypes and phenotypic values of 200 DH lines are generated by the QTL 
IciMapping software. The scanning step in simple interval mapping is set at 1 cM. It 
can be seen from figure 4.11 that several peaks are present on each chromosome 
located with QTL, but one highest peak can be clearly observed. At both sides of the 
highest peak, the LOD score drops slowly. Therefore, when multiple QTLs are linked 
on one chromosome, the LOD profile caused by each one can easily overlap, making 
them hard to be separated. 


LOD score 


Additive effect 


One-dimensional scan on six chromosomes each of 120 cM, step size 1cM 


Fic. 4.11 — LOD profile (A) and additive effect profile (B) from simple interval mapping 
when five QTLs are pre-defined and located on five chromosomes in a simulated population 
with 200 DH lines. Notes: The genome consists of six chromosomes each of 120 cM in length. 
Five QTLs are located at 35, 35, 68, 35, and 68 cM on the first five chromosomes, and their 
additive effects are equal to 1, 1, 1, 1, and —1, respectively. The broad-sense heritability of the 
phenotypic trait is equal to 0.8, and the scanning step is 1 cM. Scanning positions pointed by 
arrows on the z-axis represent the true positions of QTLs. The length of the arrow is pro- 
portional to the size of the QTL effect. The upward arrow indicates the positive additive 
effect, and the downward arrow indicates the negative additive effect. 
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4.3 Threshold Values of LOD Score in QTL Mapping 


Due to the existence and influence of random errors, two types of errors are involved 
in hypothesis tests in statistics. One is to reject the null hypothesis when it is true, 
which is called a false positive or a Type I error. The other one is to accept the null 
hypothesis when it is false, which is called a false negative or a Type II error. In QTL 
mapping, if there is no QTL at a chromosomal position, the test statistic LRT or 
LOD score at this location may still exceed a critical (or threshold) value due to the 
influence of random errors. In this situation, QTL would be wrongly declared at this 
position and the Type I error occurs. Such a QTL is called a false positive. On the 
other aspect, indeed there is one QTL at a chromosomal position, but the test 
statistic LRT or LOD score at this position does not exceed a given critical value. In 
this situation, no QTL would be declared at this position and the Type II error 
occurs. Such an inference is called a false negative. The choice of suitable critical 
values on the test statistics and probabilities of two types of errors have received 
extensive attention in theoretical studies on QTL mapping methodology (Sun et al, 
2013: Li et al., 2010, 2012; Piepho, 2001; van Ooijen, 1999; Benjamini and Hochberg, 
1995, Rebai et al., 1994). Statistical hypothesis tests focus on the control of Type I 
error, based on which the critical value can be determined for the test statistic. 
Introduced in this section are distribution characteristics of the test statistic in the 
whole-genome QTL mapping under the null hypothesis and the choice of appro- 
priate LOD thresholds to control the Type I error at a pre-defined level. The readers 
can also refer Li et al. (2010, 2012), and Sun et al. (2013) for more detailed infor- 
mation. Type 11 error, QTL detection power, and comparison of the mapping 
methods will be covered in §5.4 in the next chapter. 


4.3.1 Significance Level and Critical Value of One Test 
Statistic 


To perform a hypothesis test in statistics, one test statistic, denoted by T, has to be 
established first on the observed samples. As mentioned earlier, hypothesis tests in 
statistics focus on the control of probability to make the Type I error. When test 
statistic T follows a known distribution under the condition that the null hypothesis 
is true, two equivalent methods can be used in statistical inference. Firstly, the 
rejection or acceptance inference to the null hypothesis can be based on whether the 
test statistic calculated from the observed samples (or the observed statistic in 
short) exceeds the critical value T, under the given significance level a (e.g., 
a = 0.001, 0.05 or 0.01). Secondly, the inference can also be made by whether the 
significance probability P of the observed statistic is lower than the given signifi- 
cance level a. 

Assume the samples are randomly drawn from normally distributed populations. 
For most tests on population means and variances, the test statistics follow x” 
distribution, £ distribution, or F distribution under the null hypothesis. From the 
distribution of the test statistic, a critical value can be determined and the Type I 
error can be controlled below the given probability level. Given the probability of 
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making the Type I error, or equivalently the significance level a, the critical value T, 
satisfies the following equation. 


Pr{T > T,} =a (4.51) 


In probability and statistics, as long as the distribution of the statistic T is 
known, the critical value T, at the given significance level a can be obtained. For a 
set of observations, the observed value of the test statistic is represented by 7},,, and 
the range of the statistic is assumed to be (0, +00). When To > Ta or Tops € 
(Ta, +2), the rejection decision will be made on the null hypothesis Hp. Otherwise, 
the acceptance decision will be made. In hypothesis testing, (Tk, +00) is therefore 
called the rejection region, and (0, Te) is called the acceptance region. 

When using the above criteria in statistical inference, it can be guaranteed that 
the probability to make the Type I error does not exceed the significance level a. In 
order words, the probability would be lower than a to reject the Hp when it is true. 
According to the objectives in different research, the probability of Type I error can 
be controlled at 5%, 1%, or 0.1%. For example, one test statistic T follows the x” 
distribution with 10 degrees of freedom (figure 4.12A), then the critical value T, is 
equal to 18.31 given the significance level a = 0.05. For an observed statistic value 
Tops = 13.96, the null hypothesis would be accepted since Tep, < Ta. In case when 
the rejection decision is made, the probability to make the Type I error (that is, the 
Hp is true but the rejection decision is still made) would be below 0.05. 
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Fic. 4.12 — Probability density function of the x” distribution with 10 degrees of freedom. 
Notes: (A) The method to determine the critical value T, of one test statistic, given the 
significance level a. The shaded area in the graph is called the significance level a or the 
probability of Type I error. (B) The method to determine the significance probability P, given 
the sampling statistic Top.. The shaded area is called the significance probability P of the 
observed test statistic Tops- 


In addition, given the observed values, the sample statistic Top, can be calcu- 
lated. Under the assumption that the null hypothesis Hp is true, the observed 
statistic Tops follow the same statistical distribution as the test statistic T. Signifi- 
cance probability P of the observed statistic To»s can be calculated by equation 4.52. 


P=Pr{T > Tos} (4.52) 


Therefore, the statistical inference can also be made by the comparison between 
the significance probability P and the significance level a. When P < a, the null 
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hypothesis Hy would be rejected; otherwise, the null hypothesis would be accepted. 
Using the significance probability in statistical inference can also ensure that the 
probability to make the Type I error is no greater than the significance level a. For 
one test statistic that follows the x° distribution with 10 degrees of freedom 
(figure 4.12B), when the observed sample statistic Toy is equal to 13.96, the sig- 
nificance probability P is equal to 0.1748, which is greater than 0.05. Therefore, the 
acceptance decision would be made on the null hypothesis. Obviously, statistical 
inference based on significance probability is equivalent to the inference based on the 
critical value of the test statistic. 


4.3.2 Distribution of the LRT Statistic at Single Scanning 
Positions in the Absence of Any QTL 


The prerequisite to conducting the hypothesis test is to know the distribution of test 
statistics under the null hypothesis. Due to the same reason, the probability to make 
a Type I error can be controlled. When there is no genetic variation, the LRT 
statistic in QTL mapping at individual scanning positions follows the x” distribu- 
tion, and its degrees of freedom depend on the type of the mapping population and 
the number of parameters to be estimated. Figure 4.13 shows the frequency distri- 
butions of LRT observed from a large number of scanning positions in DH and F, 
populations under the condition that non-QTL is present in the mapping popula- 
tion. The line graph indicates the theoretical probability density of the x” distri- 
bution. It can be seen that the distributions of LRT statistics are highly consistent 
with the theoretical x° distribution. 
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Fic. 4.13 — Theoretical probability density (line) and the observed frequency (bar) distri- 
butions of the LRT statistic at single scanning positions in the DH (A) and F» (B) populations 
without the presence of any QTL. Notes: The bar graph is the frequency distribution of 7260 
LRT statistic values, and the line graph is the probability density function of the x° distri- 
bution. (A) DH population; z” distribution has 1 degree of freedom. (B) Fə population; 7” 
distribution has 2 degrees of freedom. 


LRT is a widely used method in hypothesis testing with many statistical prop- 
erties (Stuart et al., 1999; Stuart and Ord, 1994). However, the LOD score has been 
widely accepted and used as a test statistic in genetic studies. In fact, LRT and LOD 
scores are not necessarily two different things. Table 4.14 shows the equivalence 
between some selected LRT values and their corresponding LOD scores. 
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TAB. 4.14 — LRT value, the corresponding LOD score, and significance probability at four 
different degrees of freedom. 


LOD score LRT value Degree of freedom 
1 2 3 4 

0.50 2.30 0.129159 0.316228 0.512026 0.680298 
1.00 4.61 0.031876 0.100000 0.203099 0.330259 
1.50 6.91 0.008582 0.031623 0.074897 0.140844 
2.00 9.21 0.002407 0.010000 0.026621 0.056052 
2.50 11.51 0.000691 0.003162 0.009252 0.021366 
3.00 13.82 0.000202 0.001000 0.003167 0.007908 
3.50 16.12 0.000060 0.000316 0.001072 0.002865 
4.00 18.42 0.000018 0.000100 0.000360 0.001021 
4.50 20.72 0.000005 0.000032 0.000120 0.000359 
5.00 23.03 0.000002 0.000010 0.000040 0.000125 


The corresponding probabilities of significance are also given in the table for four 
different degrees of freedom. For example, if the sample LOD score is equal to 2.0, 
the corresponding LRT is about 9.21. The corresponding probabilities of significance 
are about 0.0024, 0.01, 0.0266, and 0.0561 at 1, 2, 3, and 4 degrees of freedom, 
respectively. On the other side, if significance level a = 0.05, the critical LOD scores 
should be about 1.0, 1.5, 1.5-2.0, and 2.0 at the four degrees of freedom, respectively, 
by which the probability to make Type I error would be below 0.05 in one signifi- 
cance test. 


4.3.3 Factors Affecting the Distribution 
of the Genome- Wide Largest LOD Score 


As far as the single hypothesis test is concerned, it can be seen from table 4.14 that 
the LOD score thresholds 1.0 and 1.5 can control the probability to make Type I 
error below 0.032 at one degree of freedom and two degrees of freedom, respectively. 
In interval-based QTL mapping, hypothesis tests are conducted many times during 
the whole genomic scanning. For example, if the genome size is 720 cM and the 
scanning step is set at 1 cM, the total number of tests to conduct is 720. If the 
scanning step is set at 0.5 cM, the total number of tests to conduct is 1440. In QTL 
mapping, it is expected that the probability to make a Type I error can be controlled 
at the whole genomic level (or genome-wide Type I error), such as below 5% or even 
1%. Let a be the significance level in one single test, and a, be the genome-wide 
significance level. When the genome-wide tests are independent of each other, and 
the number of independent tests is represented by k, the relationship between a and 
a, is given by equation 4.53. The left side of equation 4.53 can be understood as the 
probability of not making any Type I errors in a number of k-independent tests, 
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which is equal to the product of probabilities in the k-independent tests, i.e., the 
right side of the equation. 


1 — oz = (1 — o)” (4.53) 


VVhen o is very small, the right side in equation 4.53 can be approximated by 
1 — ka. Therefore, the approximate relationship between a and a, can be given by 
equation 4.54. 


OL FY Oe /k (4.54) 


Equation 4.54 is called the Bonferroni adjustment (Doerge and Rebai, 1996; 
Benjamini and Hochberg, 1995), which can be used to avoid the accumulation of 
Type I errors during multiple tests. Given the cumulative probability which can be 
specified in advance to control the Type I error, the Bonferroni adjustment gives the 
least probability to make a Type I error in each test. Assuming the cumulative 
probability of Type I error is to be controlled below a, = 0.01 after 10 times of 
independent tests, the probability of Type I error should be controlled below 
a = 0.001 in each test. When the LRT statistic follows the x” distribution with 2 
degrees of freedom, the significance level a = 0.001 should be used, or equivalently 3. 
0 (see table 4.14) should be used as the critical value of the LOD score in each of the 
10 tests. By doing this, the cumulative probability of Type I error can be guaranteed 
to be below a, = 0.01. 

QTL mapping involves a large number of hypothesis tests during genome-wide 
scanning. Tests on different chromosomes may be regarded as independent, but the 
tests on the same chromosome are not independent due to the genetic linkage. This 
is actually one major reason for the presence of ‘ghost’ QTL when two QTLs are 
linked (figure 4.10). It can be imagined that the shorter the chromosome, the 
stronger the dependence would be among the tests. If the relationship between 
chromosomal length and the number of independent tests (or a number of effective 
tests) can be identified, equation 4.54 can still be used to determine the significance 
level which should be used at each test, and then the critical LOD score can be 
determined accordingly from the distribution of the LRT statistic. Using this critical 
value in each test, the genome-wide probability of Type I error can be controlled. 
Theoretically, it is difficult to derive the effective number of tests in whole genomic 
scanning, and the simulation approach has to be adopted. 

Figure 4.14 shows the LOD score profiles from the whole-genome scanning in five 
simulated DH populations. The scanning step is set at 1 cM, the size of the simu- 
lated population is equal to 200, and the genome is composed of 6 chromosomes, 
each of 120 cM in length. No QTLs are assumed in the simulated populations. 
Phenotypic variance is equal to the random error variance. Therefore, as long as the 
LOD score at any scanning position exceeds the threshold, the Type I error occurs. If 
the LOD threshold is set at 2.0, one significant peak can be seen in the fourth DH 
population (figure 4.14). There are peaks at various positions in the genome in the 
other four populations, but the LOD scores at these peaks are below 2.0. Therefore, 
the Type I error occurs in one of the five simulations, or the frequency to make the 
Type I error is equal to 20%. 
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Fic. 4.14 — Simple interval mapping in five simulated DH populations without the presence 

of any QTL. Notes: Each population contains 200 DH lines. The genome is composed of six 

chromosomes, each 120 cM in length. The scanning step is set at 1 cM. 


Figure 4.15 shows the LOD profiles in five simulated Fə populations. The 
scanning step, population size, and genome information are the same as those in 
figure 4.14. Phenotypic variation is completely caused by random error. If the LOD 
score threshold is also set at 2.0, two significant peaks can be seen in the third 
population and one significant peak can be seen in the fourth population 
(figure 4.14). No significant peaks are present in the other three populations. 
Therefore, the Type I error occurs in two of the five simulated populations, or the 
frequency to make the Type I error is equal to 40%. By simulating a large number of 
QTL-free mapping populations as shown in figures 4.14 and 4.15, the distribution of 
the genome-wide maximum LOD score and probability of the genome-wide Type I 
error can be quantified. 

As can be imagined, many factors may influence the genome-wide maximum 
LOD score. By intuition, if the distribution of the maximum LOD score remains 
unchanged at different levels of one factor, such a factor may be less important and 
its influence may be ignored. On the contrary, if different distributions are observed 
at different levels of one factor, influence from the factor should be considered in the 
choice of suitable thresholds in the genome-wide QTL mapping. 

As the tests on different chromosomes can be regarded to be independent, only 
one chromosome is considered in the following simulation studies. Figure 4.16 shows 
the distributions of chromosome-wide maximum LOD score at three levels, each of 
four factors, i.e., chromosomal length, marker density, type of population, and 
population size (see also Sun et al., 2013). It can be concluded from figure 4.16 that 
population size has no obvious effect on the distribution of maximum LOD score, 
while the chromosomal length, marker density, and type of population do have an 
impact on the distribution. Therefore, when determining the effective number of 
tests, only the three factors are considered. 
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Fic. 4.15 — Simple interval mapping in five simulated F populations without the presence of 
any QTL. Notes: Each population contains 200 individuals, and the genome is composed of 6 
chromosomes, each with a length of 120 cM. The scanning step size is 1 cM. 
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Fic. 4.16 — The empirical cumulative distribution of genome-wide maximum LOD score at 
three chromosomal lengths, three marker densities, three types of population, and three 
population sizes. 


4.3.4. Number of Effective Tests and the Empirical LOD 
Score Thresholds in QTL Mapping 


In the simulation study reported by Sun et al. (2013), chromosomal length was set at 
six levels, i.e., 50, 80, 110, 140, 170, and 200 cM. Marker density was set at five 
levels, z.e., 1, 2, 5, 10, and 20 cM. Three types of the most commonly-used 
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populations were simulated, i.e., DH, RIL, and Fə. Therefore, a total of 90 scenarios 
were simulated. Each scenario was repeated 10 000 times, and therefore a total of 
10 000 sample data points were acquired for the chromosome-wide largest LOD 
score and then used to fit the empirical distribution of the largest LOD score. It is 
worth mentioning that the maximum LRT no longer follows any x” distributions. As 
a matter of fact, the accurate distribution is hardly known in theory but can be 
investigated by simulation approaches. 

For a given genome-wide significance level oz, the critical value of the 
genome-wide LOD score, denoted as ThLOD,, can be determined from the simu- 
lated empirical distribution of the maximum LOD score (i.e., LOD,), as has been 
shown in figure 4.16. The relationship is given in equation 4.55. 


Pr(LOD, > ThLOD,} = o, (4.55) 


Based on the genome-wide threshold ThLOD, calculated by equation 4.55, the 
significance level a at each scanning position can be acquired from the 7’ distribution 
followed by the LRT statistic (or 2ln(10)*LOD equivalently). The relationship is 
given in equation 4.56. 


o = Pr(LOD > ThLOD,} (4.56) 


In fact, the LRT statistic asymptotically follows the x° distribution with 1 degree 
of freedom in the additive QTL mapping in any populations with two genotypes such 
as DH and RIL, and 2 degrees of freedom in the additive and dominant QTL 
mapping in any populations with three genotypes such as Fə and F3. Given the 
genome-wide significance level az, the empirical significance level a at each scanning 
position can be determined by equation 4.56 first, and then the effective number k of 
the whole genomic scanning can be estimated by equation 4.54. 

Figure 4.17 shows the number of independent tests by the chromosomal length 
at five marker densities in three types of populations. An obvious linear rela- 
tionship can be seen between the effective number and chromosomal length, and 
the fitted line is also indicated in the figure. The longer the chromosome, the 
higher the number of independent tests. But slopes of the fitted lines are different 
by the types of mapping populations, marker densities, and genome-wide signifi- 
cance levels (oz). The denser the markers and the higher the genome-wide sig- 
nificance level, the greater the slope of the fitted line. In DH populations, when 
a, = 0.05, the slope is the largest (i.e., 0.153) when the marker density is 1 cM, 
and the smallest (i.e., 0.054) when the marker density is 20 cM. Under the same 
marker density, the slope is larger for a, = 0.01 than that for a, = 0.05. For 
different types of populations, the slopes in DH populations are smaller than those 
in RIL populations. The slopes in Fy populations are close or even equal to those 
in RIL populations in some cases. If the reciprocal of the regression coefficient 
shown in figure 4.17 can be regarded as the equivalent chromosomal length in 
one independent test, a chromosomal length at about 10 cM is equivalent to one 
independent test in DH populations; a length at about 6 cM is equivalent to one 
independent test in RIL and F» populations. 
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Fic. 4.17 — Linear relationship between the number of independent tests and chromosomal 
length for five marker densities (7.e., MD = 1, 2, 5, 10, and 20 cM) in the one-dimensional 
scanning in DH, RIL, and F» populations at two genome-wide significant levels (i.e., 0.05 and 
0.01). 


In practice, when the type of population, marker density, and genome size are 
known, the number of independent tests, i.e., k, can be estimated for the mapping 
population, according to the empirical equations shown in figure 4.17. Then, cor- 
responding to the genome-wide level a, = 0.05 or 0.01, the significance level a at a 
single scanning position can be determined by the Bonferroni adjustment (i.e., 
equation 4.54). Finally, according to the x° distribution followed by the single-test 
LRT statistic in different types of populations, the critical value of LRT statistic can 
be calculated by equation 4.57, from which the LOD score threshold can be 
acquired. 


o = Pr{LRT > ThLRT} (4.57) 


Tables 4.15-4.17 give the LOD score thresholds for the three representative 
populations DH, RIL, and Fs, respectively, by considering various genome sizes, 
three marker densities (£.e., 1, 5, and 20 cM), and two genome-wide significant 
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levels (i.e., oy = 0.05 and 0.01) in the one-dimensional QTL mapping. It can be 
seen that higher LOD thresholds are required for larger genomes, denser markers, 
and a higher level of genome-wide significance. For the fixed genome size, LOD 
score thresholds in RIL populations are higher than the thresholds in DH popu- 
lations but lower than the thresholds in Fə populations. Although the slopes of 
linear lines as shown in figure 4.17 are similar in RIL and F» populations, the 
degrees of freedom for the LRT statistic is different, i.e., 1 in RIL populations and 
2 in Fə populations. Therefore, the LOD score thresholds in Fə populations are 
higher than those in RIL populations. 

It can be seen from tables 4.15-4.17 that when a, = 0.05, the LOD score 
thresholds are ranged at 1.83-3.37, 1.92-3.62, and 2.58-4.30 for DH, RIL, and Fə 
populations, respectively, for a wide range of marker density at 1-20 cM and 
genome size at 250-4000 cM. When a, = 0.01, the LOD score thresholds ranged 
from 2.64—4.16, 2.63—4.36, and 3.35-5.09 for the three types of populations. In 
practice, a suitable empirical LOD score threshold can be chosen from 
tables 4.15—4.17, based on the type of the mapping population, the total length 
of the linkage map, the average distance between adjacent markers, and some 
other factors. 


TAB. 4.15 — Empirical LOD thresholds at two levels of genome-wide significance in three 
types of population and various genome sizes determined by the constructed linkage map. 


Genome size (cM) a, = 0.05 a, = 0.01 
DH RIL F2 DH RIL F2 

50 1.61 1.84 2.40 2.37 2.56 3.18 
75 1.77 2.01 2.57 2.53 2.73 3.36 
100 1.88 2.12 2.70 2.65 2.84 3.49 
150 2.04 2.28 2.87 2.81 3.01 3.66 
200 2.16 2.40 3.00 2.93 3.13 3.79 
250 2.24 2.49 3.10 3.02 3.22 3.88 
300 2.32 2.56 3.17 3.10 3.29 3.96 
500 2.52 2.77 3.40 3.31 3.50 4.18 
1000 2.80 3.05 3.70 3.59 3.79 4.49 
1500 2.97 3.22 3.87 3.76 3.95 4.66 
2000 3.09 3.33 4.00 3.88 4.07 4.79 
3000 3.25 3.50 4.17 4.04 4.24 4.96 
4000 3.37 3.62 4.30 4.16 4.36 5.09 


Note: Marker density is equal to 1 cM; markers are assumed to be evenly distributed. 
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TAB. 4.16 — Empirical LOD thresholds at two levels of genome-wide significance in three 
types of population and various genome sizes estimated by the constructed linkage map. 


Genome size (cM) a, = 0.05 z = 0.01 
DH RIL Fy DH RIL F2 

50 1.44 1.59 2.16 2.12 2.31 2.98 
75 1.59 1.75 2.34 2.28 2.48 3.15 
100 1.71 1.86 2.46 2.40 2.59 3.28 
150 1.87 2.02 2.64 2.56 2.76 3.46 
200 1.98 2.14 2.77 2.68 2.87 3.58 
250 2.07 2.22 2.86 2.77 2.96 3.68 
300 2.14 2.30 2.94 2.84 3.04 3.76 
500 2.35 2.50 3.16 3.05 3.25 3.98 
1000 2.63 2.78 3.46 3.34 3.53 4.28 
1500 2.79 2.95 3.64 3.50 3.70 4.46 
2000 2.91 3.07 3.77 3.62 3.82 4.58 
3000 3.07 3.23 3.94 3.79 3.99 4.76 
4000 3.19 3.35 4.07 3.91 4.10 4.88 


Note: Marker density is equal to 5 cM; markers are assumed to be evenly distributed. 


Tas. 4.17 — Empirical LOD thresholds at two levels of genome-wide significance in three 
types of population and various genome sizes estimated by the constructed linkage map. 


Genome size (cM) a, = 0.05 a, = 0.01 
DH RIL F2 DH RIL F2 

50 1.21 1.29 1.88 2.00 1.98 2.65 
75 1.36 1.45 2.06 2.16 2.14 2.83 
100 1.47 1.56 2.18 2.27 2.26 2.95 
150 1.63 1.72 2.36 2.44 2.42 3.13 
200 1.74 1.83 2.48 2.55 2.54 3.25 
250 1.83 1.92 2.58 2.64 2.63 3.35 
300 1.90 2.00 2.66 2.72 2.70 3.43 
500 2.11 2.20 2.88 2.92 2.91 3.65 
1000 2.38 2.48 3.18 3.21 3.19 3.95 
1500 2.55 2.64 3.36 3.37 3.36 4.13 
2000 2.66 2.76 3.48 3.49 3.47 4.25 
3000 2.83 2.92 3.66 3.66 3.64 4.43 
4000 2.95 3.04 3.78 3.78 3.76 4.55 


Note: Marker density is equal to 20 cM; markers are assumed to be evenly distributed. 


4.3.5 Permutation Test and the Empirical LOD Score 
Thresholds in QTL Mapping 


The permutation test is a non-parametric method in statistics and is sometimes 
called the randomization test. Non-parametric statistical methods are suitable for 
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situations when the population distribution where the samples are drawn is hardly 
known, the observed samples do not follow any distributions with obvious mathe- 
matical expressions, or the distribution of test statistics is hardly derived. In the 
permutation test, the observed samples are randomly re-arranged to artificially 
mimic the scenario that the null hypothesis is true, and in the meantime, an 
empirical distribution can be acquired on the test statistic. The permutation test has 
two major advantages. Firstly, the concept of the method is simple in statistics. 
Secondly, the test can be conducted even though the distribution of test statistics 
and population to generate the samples are not clear (Stuart et al., 1999; Doerge and 
Churchill, 1996; Churchill and Doerge, 1994). But generally speaking, the 
non-parametric methods are less effective and less efficient than parametric methods 
when the population distributions are known. 

Before moving ahead on the application of the permutation test in QTL map- 
ping, the significance test between two groups of samples is used here to illustrate 
the parametric and non-parametric methods in statistics. Assume wx and uy are 
population means of distributions X and Y, respectively, and the null hypothesis to 
be tested is Hp: ux = uy. Two groups of samples are represented by X1, Xo, ..., Xn 
and Yi, Yo, ..., Yn, with sample means X and Y, and sample variances S? and $3., 
respectively. If the two groups of samples are known to be randomly drawn from two 
normal distributions having the equal variance o” which is known, one test statistic 


following the standardized normal distribution, i.e., Z = YN (0, 1) when 
ux = uy, can be built and then used to test whether the two normal distributions 
have an equal population mean. If the two groups of samples come from normal 


populations having equal variance o” which is unknown, one test statistic following 


the £ distribution, i.e., t = yn ~ t(2n — 2) when wy = py, can be used to test 
məö: Y 


whether “x and uy are equal or not. 

The two tests mentioned above are called parametric methods in statistics. 
When knowledge of the two distributions is rather limited, and the distributions of 
samples and test statistic are unknown, it becomes impossible to use the parametric 
methods in parameter estimation and hypothesis tests. However, whatever the two 
populations are, X is still an unbiased estimate of wx, and Y is still an unbiased 
estimate of uy. Intuitively, if ux = uy, the absolute difference between two sample 
means, i.e., Ag = |X — Y], should not be too large, regardless of the distributions of 
the two populations. Therefore, if a number of the absolute difference (which is the 
test statistic here) can be obtained under the null hypothesis Hp: wx = Muy, an 
empirical distribution on the test statistic can be acquired, from which the signifi- 
cance probability can be calculated. This is actually the basic idea behind the 
permutation test. 

For the two groups of samples, the specific method is to mix up the 2n samples, 
treat them as the samples randomly drawn from one population, and then randomly 
select n of them as the samples from distribution X, and the remaining as the 
samples from distribution Y. Through such a randomization procedure, two groups 
of the re-generated samples can be treated as coming from the same population. In 
other words, the null hypothesis is true for the two groups of permutated samples. 
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Test statistic A= |X — Y| can be calculated for the two permutated groups of 
samples and one data point is therefore obtained for each permutation. The ran- 
domization procedure can be repeated 500 times or more so that one empirical 
distribution can be obtained on test statistic A = |X — YI. By arranging the 
observed test statistic points from the smallest to the largest, an empirical cumu- 
lative distribution is acquired on the test statistics. Given the significance level a, the 
critical value of the test statistic is equal to the quantile value at probability 1 — a 
on the empirical cumulative distribution. If wanted, the significance probability of 
the original sample statistic Ag can also be calculated from the observed cumulative 
distribution, and the statistical inference can be made by comparing the significance 
probability with the given significance level. 

Table 4.18 gives two groups of samples, with the same size of 12. Two sample 
means are equal to 120.083 and 107.417, and the absolute difference in sample means 
is about 12.67. Figure 4.18 shows the data points on test statistics from 100 times of 
permutation tests. For clarity, figure 4.18A is arranged by the order in conducting 
the permutation tests. The dotted line in the figure indicates the absolute difference 
in original samples, i.e., 12.67. Figure 4.18B is arranged in the ascending order on 
test statistics. The dotted line in the figure indicates the position of the 95th 
quantile, i.e., 15.83. For the 100 permutation tests, the test statistic A = |X — Y| 
exceeds Aos = |X — Y| in 16 times (figure 4.18A). Therefore, the significance 
probability of Aos = |X — Yİ can be estimated to be equal to 0.16. Given the sig- 
nificance level of 0.05, the null hypothesis is accepted; that is to say, the two dis- 
tributions from which the two groups of samples were drawn are not significantly 
different in population means. 


TAB. 4.18 — Observed values in two groups of random samples from two distributions. 


Sample 1 2 3 4 5 6 7 8 9 10 11 12 Sample 


mean 
X 134 146 104 119 124 161 108 83 113 129 97 123 120.083 
Y TÜ 118 101 85 107 132 94 135 99 117 126 105 107.417 


In fact, the sorted data points represent the empirical cumulative distribution 
obtained from permutation tests, where the various quantile values can be easily 
identified. For example, the 95th quantile is equal to 15.83 in figure 4.18B, which 
can be regarded as the critical value of the test statistic at a significance level of 0.05. 
The difference between the two sample means given in table 4.18 is equal to 12.67, 
lower than the critical value at 0.05, and the null hypothesis is accepted. As another 
example, the difference in sample means corresponding to the 99th quantile is equal 
to 27.83, which can be regarded as a critical value for the test statistic at a signif- 
icance level of 0.01. 
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Fic. 4.18 — Permutation test on the difference between two population means. Notes: 
(A) Scatter plot of the difference between two sample means from 100 times of permutation 
test. The dotted line indicates the difference between the two groups of original samples. 
(B) Scatter plot of the difference between two sample means from 100 times of permutation 
test, rearranged by the ascending order in absolute difference. The dotted line indicates the 
95th quantile obtained from the 100 permutation tests. 


From previous examples on the calculation of critical values at significance levels 
0.05 and 0.01, it can be seen that a large number of permutation tests, e.g., 500 or 
1000, are needed to ensure an accurate empirical distribution and an accurate critical 
values on the test statistic. When the empirical distribution is acquired from a large 
number of permutation tests, the critical values can be determined by quantiles, or 
the significance probability of the observed test statistic can be calculated. 

Similar to the parametric hypothesis tests, two approaches can be used here to 
test whether the two population means are equal or not. Firstly, acceptance or 
rejection of the null hypothesis can be based on the significance probability P of the 
sample statistic. If P < a, the null hypothesis will be rejected; otherwise the null 
hypothesis is accepted. Secondly, the decision can be based on the critical value at 
significance level a. If Agus > Aq, the rejection decision is made; otherwise the 
acceptance decision is made. The two approaches are equivalent in statistical 
inference. In modern statistics, the first one, i.e., comparing the significance prob- 
ability with a given significance level, is more frequently adopted. 

In QTL mapping, the null hypothesis to be tested represents the scenario that no 
QTLs are present in the mapping population. Shuffling the relationship between 
phenotype and genotype can actually mimic the scenario that the null hypothesis is 
true. Figure 4.19A shows the genome-wide maximum LOD scores from 1000 times 
permutation tests in the barley DH population. From the sorted LOD scores, the 
95th quantile can be identified to be equal to 2.72. Therefore, 2.72 can be used as the 
critical value of the LOD score at the genome-wide level a, = 0.05. If used, there is a 
95% chance to make the correct inference when no QTLs are present in the popu- 
lation, and the probability to make the Type I error is controlled below 5%. The 
total length of the seven chromosomes is about 1274 cM estimated from the linkage 
maps in the barley population, with an average marker space at about 10 cM. It can 
be seen from table 4.16, when a, = 0.05 and for DH populations, the critical LOD 
score is equal to 2.63 for genome length 1000 cM, and equal to 2.79 for genome 
length 1500 cM. Therefore, the critical value obtained by permutation tests is close 
to that obtained by simulation. 
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Fic. 4.19 — Genome-wide maximum LOD scores obtained from 1000 times permutation test 
in one barley DH population (A) and one Fə population (B). 


Figure 4.19B shows the genome-wide maximum LOD scores from 1000 times 
permutation tests in an Fə population. From the sorted LOD scores, the 95th 
quantile can be identified to be equal to 3.78, which is the critical value of the LOD 
score at the genome-wide level a, = 0.05. Using this critical value, there is a 95% 
chance to make the correct inference when no QTLs are present in the population, 
and the probability to make the Type I error is controlled below 5%. The total 
length of the four chromosomes is about 396.09 cM, with an average marker space of 
about 5.66 cM. It can be seen from table 4.16, when a, = 0.05 and for Fə popula- 
tions, the critical LOD score is equal to 2.94 for genome length 300 cM and equal to 
3.16 for genome length 500 cM, which is slightly lower than the critical value of 3.78 
obtained by permutation test. The difference may be caused by the abnormal dis- 
tribution of the phenotypic trait in this population. By simulation approach, nor- 
mally distributed phenotypic values can be assured. However, shuffling the 
relationship between genotype and phenotype in the permutation test will not 
change the phenotypic distribution. In this sense, the empirical distribution acquired 
from simulation studies may be more suitable for the choice of critical values in QTL 
mapping. As one alternative option, phenotypic values in permutation tests can also 
be simulated, as drawn from normal distributions and then used in QTL mapping to 
acquire the sampling data points on the genome-wide maximum LOD score. 

In genome-wide scanning for the presence of QTL, the LRT statistic at a single 
scanning position follows the x” distribution with known degrees of freedom. Since 
hypothesis testing is performed throughout the whole genome in QTL mapping, 
Type I error can accumulate to a high level if not controlled properly. Therefore, in 
practice, the researchers are more concerned with the probability to make a Type I 
error after all tests are completed in the whole genome or the genome-wide proba- 
bility. Due to genetic linkage, the tests on one chromosome are not independent. It is 
difficult to derive the theoretical relationship between the genome-wide significance 
and the critical value that should be used in individual tests. Two empirical methods 
introduced in this section can be used to determine the genome-wide LOD thresh- 
olds in order to control the genome-wide Type I error. 

One method is to establish an empirical relationship on the number of inde- 
pendent tests in the whole-genome testing through systematic simulation studies. 
The number of independent tests is mainly affected by factors such as genome size, 
marker density, and type of the mapping population. Given the number of 
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independent tests, the Bonferroni adjustment can be used to determine the critical 
LOD score for the genome-wide testing. In one practical QTL mapping study, the 
appropriate LOD score threshold can be chosen from tables 4.15—4.17 by type of the 
population, genome size, average marker density, and the genome-wide significance 
level. 

The other method is to generate a large number of sampling data points on the 
maximum LOD score under the null hypothesis by permutation tests, i.e., an 
empirical distribution on the maximum LOD score. Maximum LOD scores obtained 
by permutation tests are sorted from the smallest to the largest, and the value at a 
specific quantile is used as the LOD score threshold in genome-wide testing. For 
example, the LOD threshold of the 95th quantile value is equivalent to the 
genome-wide significance level at 0.05; the 99th quantile value is equivalent to the 
genome-wide significance level at 0.01. LOD thresholds obtained by permutation 
tests are population-specific and may be better adapted to the population used in 
QTL mapping. However, the permutation test is more time-consuming. In each 
permutation test, phenotypic values are randomized, and the entire procedure of 
QTL mapping has to be repeated, to acquire one sampling data point of maximum 
LOD score. 


Exercises 


4.1 Given below are kernel weights (mg) of 145 lines in the barley DH population, 
grouped by their marker types at locus Act8A, where 0 represents the genotype of 
parent TR306, 2 represents the genotype of parent Harrington, and —1 represents 
the missing genotype. 


Marker Kernel weights (mg) of lines in the barley DH population 

Type 0 41.02 41.50 44.10 42.09 42.52 43.98 40.68 41.02 44.08 45.04 
41.33 41.14 4846 42.45 39.55 39.05 45.99 41.21 46.63 42.46 
40.62 43.03 43.30 43.52 41.57 42.39 42.37 46.48 41.44 38.33 
42.17 39.17 42.77 41.19 39.40 42.54 41.11 43.86 39.32 40.67 
40.59 41.86 41.82 41.37 44.36 46.56 43.82 42.12 40.70 41.50 
41.28 41.29 46.18 42.79 43.02 43.19 42.55 41.85 36.46 43.72 
42.67 42.42 43.22 46.04 41.65 40.30 39.71 41.43 42.11 40.63 

Type 2 40.42 40.29 45.75 40.20 45.66 44.07 40.88 41.58 47.28 44.40 
42.20 39.77 47.20 44.20 43.30 40.78 42.22 40.09 46.08 40.94 
46.11 44.16 42.59 39.16 37.52 45.45 39.67 42.16 43.58 46.62 
42.35 43.11 44.69 43.68 42.32 43.07 38.77 42.70 41.85 43.88 
43.33 44.39 42.76 40.79 41.91 42.00 45.31 42.93 39.92 44.40 
46.45 43.21 45.47 39.49 45.53 40.75 42.38 42.79 39.55 44.85 
43.16 42.50 41.22 42.52 40.91 44.99 37.69 42.85 45.20 43.76 
46.61 43.47 39.14 43.75 

Type -1 40.11 


190 Linkage Analysis and Gene Mapping 


(1) For each of the two marker groups (i.e., 0 and 2), classify the DH lines into 10 
groups, starting from 36 mg and with a grouping distance of 1.5 mg. Draw the 
frequency distributions on kernel weight for the two marker groups. 

(2) Calculate the sample means and sample variances on kernel weight for the two 
marker groups at locus Act8A. 

(3) Assume the two marker groups have equal variance. Use single marker analysis 
to conduct the significance test for the two marker groups at locus Act8A. 


4.2 Given below are the phenotypic values of 110 individuals in an F» population, 
grouped by their marker types at locus M1-8, where A and B represent the two 
parental genotypes, H represents the heterozygous genotype, and X represents the 
missing genotype. 


Marker Phenotypic values of individuals in an Fə population 

Type A 19.75 19.39 20.40 19.35 19.84 20.27 20.60 20.29 21.00 20.38 
23.23 20.17 20.19 17.13 21.04 21.22 20.60 23.21 20.97 22.31 
18.95 21.34 20.13 21.63 22.22 20.19 19.88 18.94 

Type B 18.06 18.37 16.29 18.74 20.38 19.41 17.57 18.31 16.14 18.88 
17.42 17.44 20.06 18.16 19.87 19.29 20.41 20.10 19.22 19.63 
18.53 17.07 20.82 18.16 19.00 19.34 16.51 17.70 17.53 

Type H 18.19 19.69 20.71 16.99 21.75 22.06 19.83 20.32 19.79 21.39 
21.46 20.73 19.92 19.96 17.72 22.41 20.75 19.50 18.08 19.50 
17.10 18.22 18.93 20.36 20.68 21.19 21.17 21.40 19.99 23.19 
19.46 21.71 18.01 17.63 19.78 20.66 20.27 18.76 17.23 21.22 
19.49 17.47 18.53 19.38 17.77 18.55 18.90 20.89 21.09 20.69 

Type X 19.33 19.67 20.50 


(1) Calculate the sample means and sample variances for the three marker groups 
at locus M1-8. 

(2) Use single marker analysis to conduct the significance test for both additive and 
dominant effects at locus M1-8. 

(3) Use ANOVA to conduct the significance test on the association between locus 
M1-8 and the phenotypic trait. 


4.3 Given below are genotypes of 90 individuals in a soybean Fə population at 
markers *Satt339 and “Sat 033 (represented by Mı and Mg in the table) and the 
chlorophyll content (%) as the phenotypic trait, where 2 and 0 represent the 
genotypes of two parents, 1 represents the heterozygous genotype, and —1 represents 
the missing genotype. 
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Individual 1-30 Individual 31-60 Individual 61-90 
M, M, Phenotype M, M, Phenotype M, M, Phenotype 
2 1 1 2 0 14 1 1 41 
0 1 36 0 2 37 1 2 42 
2 1 18 0 1 37 1 1 34 
1 1 36 0 1 34 0 0 30 
1 0 32 1 1 32 2 1 14 
2 0 6 1 2 33 0 1 37 
0 1 14 0 1 38 1 0 41 
0 2 37 1 1 27 0 0 37 
0 1 18 0 0 38 1 0 29 
1 1 22 0 0 35 1 2 32 
2 2 15 2 0 20 1 1 31 
0 0 6 1 1 35 1 1 34 
0 1 31 0 0 33 0 1 43 
1 2 39 1 2 35 2 2 8 
1 2 37 1 2 40 2 1 8 
1 1 28 0 0 32 2 2 12 
0 1 40 0 -1 31 2 2 6 
1 0 31 2 1 20 0 1 41 
0 0 40 1 0 33 2 1 19 
2 =. 8 1 1 35 1 2 36 
0 1 34 1 2 40 1 1 36 
2 1 34 1 2 40 0 1 38 
1 0 10 1 2 21 2 2 25 
1 2 31 0 0 23 2 1 18 
1 2 33 1 1 33 -1 -1 21 
1 2 33 0 1 34 1 2 40 
1 0 28 1 0 34 0 0 34 
2 2 16 2 1 26 1 2 30 
0 2 31 1 1 32 2 0 18 
1 0 16 1 1 34 1 1 27 


— 
— 


For each marker, classify the individuals into a number of groups starting from 0 

and with a grouping distance of 10. Draw the frequency distributions of phe- 

notypic traits for the three marker groups. 

2) For each marker, calculate the sample means and sample variances on the 
phenotypic trait for the three marker groups. 

3) For each marker, use single marker analysis to conduct the significance test for 
both additive and dominant effects. 

4) For each marker, use the F statistic from ANOVA to conduct the significance 

test on the association between the marker and phenotypic trait. 
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4.4 Two markers A and B are located at 15 and 30 cM on one chromosome, and one 
QTL is located at 20 cM on the same chromosome. Genotypes of two homozygous 
parents are represented by AAQQBB and aaqqbb. Phenotypic values of individuals 
with genotype QQ follow the normal distribution with a mean of 20 and a variance 
of 10; individuals with genotype qq follow the normal distribution with a mean of 15 
and a variance of 10. 


(1) Use the Haldane mapping function to calculate the recombination frequencies 
between locus A and locus Q, between locus Q and locus B, and between locus 
A and locus B. 

(2) Calculate the genetic variance and phenotypic variance on the phenotypic trait 
in the DH population, the expected proportions of individuals with the two 
QTL genotypes in each of the four marker groups, and the phenotypic means of 
the four marker groups. 


4.5 Two markers A and B are located at 15 and 30 cM on one chromosome, and one 
QTL is located at 20 cM on the same chromosome. Genotypes of two homozygous 
parents are represented by AAQQBB and aaqqbb. Phenotypic values of individuals 
with genotype QQ follow the normal distribution with a mean of 20 and a variance 
of 10; individuals with genotype Qq follow the normal distribution with a mean of 18 
and a variance of 10; individuals with genotype qq follow the normal distribution 
with a mean of 15 and a variance of 10. 


(1) Calculate the genetic variance and phenotypic variance on the phenotypic trait 
in the Fə population. 

(2) Calculate the expected proportions of individuals with the three QTL geno- 
types in each of the nine marker groups in the F population. 

(3) Calculate the phenotypic means of the nine marker groups in the F population. 


4.6 Use the single marker analysis method to conduct QTL mapping in one DH or 
RIL population provided in the QTL IciMapping software. 


(1) Draw the histogram graphs of the LOD score and additive effect on markers. 
(2) Work out the markers that are significantly associated with the phenotypic trait 
and the proportion of phenotypic variance explained by each marker. 


4.7 Use the simple interval mapping method to conduct QTL mapping in one DH or 
RIL population provided in the QTL IciMapping software. 


(1) Draw the genome-wide LOD score profile and additive effect profile. 

(2) Make a table to demonstrate the QTL mapping results, including QTL posi- 
tions on chromosome, most closely linked markers on both sides of the identified 
QTL, genetic effects of the QTL, and the confidence interval of QTL position. 


4.8 Use the simple interval mapping method to conduct QTL mapping in one Fə 
population provided in the QTL IciMapping software. 


(1) Draw the whole-genome LOD score profile and additive and dominant effect 
profile. 
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(2) Make a table to demonstrate the QTL mapping results, including QTL positions 
on the chromosome, most closely linked markers on both sides of the identified 
QTL, genetic effects of the QTL, and the confidence interval of QTL position. 


4.9 Assume the two groups of samples given in table 4.18 come from two normal 
distributions with equal variance. 


(1) Use the tstatistic to test whether the two normal populations have equal mean. 
(2) Use the LRT statistic to test whether the two normal populations have equal 
mean. 


4.10 Calculation of mean and variance for the normal mixture distribution 
X. Assuming that a number of k normal populations N(u,, 6?) (i = 1, 2, ..., k) are 
mixed with given proportion p; (£ — 1, 2, ..., k), the probability density of the 
mixture distribution X is given by, 


F(elur, <- Hp, Öz rey ore) = 5 pif (2|M;, o?) 
j: k 


According to the definition of the population mean in statistics, the mean of the 
mixture distribution X can be written as the weighted average of individual popu- 
lations as follows, 


E(X) = ffi, 77:27. 
p2 pf a a|p;,07)dz = 5 pil, represented as fi 


In addition, 
.— =o ən a” f(all, 0; 
= om (a; +H) e pisi T ” Pill 


geli... 5 ek ET, 


V(X) = E(X?) — HEÇ) = p? pioi + p> uu -p 


It can be seen that the variance of mixture distribution X consists of two parts. 
The first part is equal to the weighted average of variances of the component dis- 
tributions, and the second part is equal to the variance due to the difference in 
means of the component distributions. 


(1) Based on exercise 4.4, calculate the variance each of the four marker groups in 
the DH population, based on the distributions of two QTL genotypes. 

(2) Based on exercise 4.5, calculate the variance each for the nine marker genotypes 
in the F, population, based on the distributions of three QTL genotypes. 


Chapter 5 


Inclusive Composite Interval Mapping 


Roughly speaking, research on QTL mapping methodology have gone through three 
stages (Wang, 2009; Wang et al., 2006; Lynch and Walsh, 1998). The first one is 
single-marker or single-point analysis (SMA) (§4.1, chapter 4), which tests the 
existence of QTL by comparing the difference between phenotypic means from dif- 
ferent genotypes at one single marker locus. In SMA, both recombination frequency 
and effect of the QTL are confounded and therefore the effects of QTL cannot be 
properly estimated. Accurate mapping results can only be acquired when positions of 
QTL and marker are completely overlapped, and each chromosome contains at most 
one QTL. The second one is simple interval mapping (SIM, or IM in short) (§4.2, 
chapter 4), i.e., without any control over background genetic variation. One major 
assumption with IM is that each chromosome contains at most one QTL. Misleading 
results may occur when the situation in the actual mapping population does not 
conform to this assumption. For example, when there are two QTLs linked on 
one chromosome with genetic effects in opposite directions, neither of them may 
be detected. If two linked QTLs act in the same direction, a “ghost” QTL may be 
present in the middle of the two true QTLs (see §4.2.6, chapter 4). In addition, 
confidence intervals of the detected QTLs are wide; deviations of the estimated 
positions and effects are large. The third stage is interval mapping with background 
control. By introducing other markers outside the scanning interval as covariates, the 
influence of QTLs outside the current mapping interval can be controlled, and the 
phenomenon of “ghost” QTL can be avoided to great extent, making it suitable for 
the scenario when multiple QTLs are present on the same chromosome. 

Composite interval mapping (CIM) proposed by Zeng (1994) is a QTL mapping 
method with background control. CIM combines IM with multiple marker regression 
analysis, aiming to control the effects of QTLs on other intervals or chromosomes 
onto the QTL that is being tested. However, in the implemented algorithm of CIM, 
the QTL effect at the current testing position and regression coefficients of marker 
variables used to control the genetic background were estimated simultaneously. 
The coefficient of the same marker variable may have different estimated values as 
the testing position moves along the chromosome. The algorithm of CIM cannot 
completely ensure that the effect of QTL at the current testing interval is not 
absorbed by background marker variables, and may result in biased estimation of 
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QTL position and effect, for examples see table 4 and figure 1 in Zeng (1994). In 
addition, different selection methods on background marker variables may give 
rather different mapping results, and the nature of the preferred method is not clear. 
In addition, CIM is difficult to be extended to mapping epistatic QTLs (Wang, 2009; 
Li et al., 2007; Zeng et al., 1999; Kao et al., 1999). 

To address these issues associated with CIM, we proposed inclusive composite 
interval mapping (ICIM), which consists of two steps. In the first step, information on 
all markers is used to select important marker variables through stepwise regression, 
and in the meantime, the effects of the selected marker variables are estimated. In the 
second step, the linear model obtained by stepwise regression is used to adjust the 
original phenotypic values. Based on the adjusted values, additive (and dominant) 
QTLs are identified through one-dimensional scanning, and epistatic QTLs are 
identified through two-dimensional scanning. This mapping strategy simplifies the 
process of controlling the background genetic variation and improves the power of 
QTL detection (Li et al., 2007, 2010, 2012; Wang, 2009; Zhang et al., 2008). This 
chapter introduces ICIM for additive (and dominant) QTL, and the next chapter will 
introduce ICIM for epistatic QTLs and QTL by environment interaction. 


5.1 Importance of the Control on Background Genetic 
Variation in QTL Mapping 


In SMA and IM, there is not any control over the background genetic variation, 
which, together with random errors, make up the sampling variance in parameter 
estimation and hypothesis testing. Statistically, the accuracy of parameter estima- 
tion and power of hypothesis testing are affected, on one hand, by the true effects of 
distribution parameters, and, on the other hand, by the variance of the population 
(or distribution) where the samples are drawn. The larger the sampling variance, the 
lower the accuracy of parameter estimation, and the lower the power in hypothesis 
testing. Therefore, QTL detection powers by SMA and SIM are not high. 

As far as the estimation of the population mean is concerned, in most cases, the 
sample mean is the best estimate of the population mean. As for normal distribution 
N(u, 67), and a set of n random samples, the sample mean follows the distribution 
as given in equation 5.1. Therefore, the smaller the population variance o”, the 
smaller the variance of the sample mean, and the closer the sample mean is to the 
population mean u. When population variance is smaller, the population mean can 
be more accurately estimated by the sample mean. At the same time, the statistical 
power of hypothesis testing related to the estimated parameters would be higher. 


X~N (1 Zə (5.1) 
n 


Assume one trait is controlled by two genetically independent QTLs (represented 
by Qı and Qə) in one DH population (table 5.1), and additive effects of Qı and Q? are 
equal to 2 and 1.5, respectively. In the absence of segregation distortion, 
the genetic variance of one QTL in the DH population is equal to the square of its 
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additive effect, and total genetic variance is the sum of genetic variances from the two 
QTLs. Assuming that the error variance on the trait is equal to 1, the broad-sense 
heritability of the trait in the DH population is equal to 0.86 (table 5.1). If two 
genotypes QQ and qq at Qı can be distinguished, mean value of QQ is equal to the 
population mean plus the additive effect, i.e., 20 + 2 = 22; and mean value of qq is 
equal to the population mean minus the additive effect, i.e., 20 — 2 = 18. Variance 
within either genotype is composed of two parts, i.e., variance due to the segregation 
at Qə (i.e., 2.25), and random error variance (£.e., 1.0). Therefore, the variance of the 
sub-population of either genotype at Qı is equal to 3.25 (figure 5.1A). Similarly, the 
mean values of two genotypes QQ and qq at Qə are equal to 21.5 and 18.5, respectively. 
Variance within either genotype at Qə is also composed of two parts, i.e., the variance 
caused by the segregation at Qı (i.e., 4), and random error variance (i.e., 1.0). 
Therefore, the variance of the sub-population of either genotype at Qə is equal to 5 
(figure 5.1B). 


TAB. 5.1 — Genetic parameters on one trait controlled by two QTLs in one DH population. 


Population Additive Genetic Total Error Broad-sense 
mean effect variance genetic variance heritability 
Q: Qə Qı Qə variance 
20 2 1.5 4 2.25 6.25 1 0.86 


Suppose that the background genetic variation can be controlled. For example, 
when studying the distributions of genotypes at Q1, genetic variance caused by the 
segregation at Qə is controlled, so that the variation within each genotype at Qı is 
only caused by random error. Mean values of two genotypes at Qı do not change. 
After controlling the background genetic variation, mean values of the two geno- 
types at Qı are still equal to 22 and 18, but the variance is decreased from 3.25 to 1, 
and the difference between the two genotypic means becomes 4 times the standard 
deviation (figure 5.1C). When there is no control, the absolute difference between 
two genotypic means is still 4, but the difference between the two means is only 2.22 
times the standard deviation (i.e., 4/v 3.25). For Qə, if genetic variance caused by 
the segregation at Qı can be controlled, mean values of the two genotypes in Qə are 
still 21.5 and 18.5. However, the variance is decreased from 5 to 1 (figure 5.1D), and 
the difference between two genotypic means is increased from 1.34 times (i.e., 3/5) 
the standard deviation to 3 times the standard deviation. 

When performing the significance test on means of two normal populations with 
equal variance, the greater the ratio of the difference between two means to the 
standard deviation of a population, the easier it is to detect the difference between 
two population means, and the higher the statistical power in detection. It can also 
be seen from figure 5.1 that for both Q1, which has a larger effect, and Qə, which has 
a smaller effect, the control on background genetic variation reduces the sampling 
variance within each QTL genotype, making it easier to observe the two genotypes 
(dotted lines in figure 5.1) as components included in the mixture distribution (solid 
line in figure 5.1). Test for the existence of QTL is also based on the significant 
difference between means of different QTL genotypes. The greater the difference, the 
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Fic. 5.1 — Component distributions of two QTL genotypes (dashed lines) and their mixture 
distribution (solid line) without (A and B) and with (C and D) the control of background 
genetic variation. Notes: Assume that there are two genetically independent QTLs in a DH 
population, their additive effects are equal to 2 (A and C) and 1.5 (B and D), the random 
error variance of the trait is equal to 1, and the population mean is equal to 20. A. Without 
background control, mean values of two components of genotypes at Q: are 18 and 22, and the 
sampling variance is 3.25. B. Without background control, mean values of two components of 
genotypes at Qə are 18.5 and 21.5, and the sampling variance is 5. C. With background 
control, mean values of two components of genotypes at Qı are 18 and 22, and the sampling 
variance is 1. D. With background control, mean values of two components of genotypes at Qə 
are 18.5 and 21.5, and the sampling variance is 1. 


greater the possibility that the QTL is detected. The minor difference can be easily 
masked by random errors, and thus fails to reach the significance level which is 
required to reject the null hypothesis of no QTL. When different genotypes at 
one locus result in similar phenotypic means on a trait in interest, the difference in 
the genetic constitution will not have a significant effect on the phenotypic trait. The 
chromosomal position to be tested will be concluded to have no QTL, even though 
the DNA sequence at this locus may still be different. 

For most quantitative traits and genetic populations, there are multiple genetic 
loci segregating at the same time. Individuals with different combinations of alleles 
at different loci have different phenotypic values, and the genetic variance of the 
population comes from the genotypic segregation at all genetic loci that control the 
phenotypic trait. In SMA and IM, both background genetic variation and random 
error variance are included in the sampling variance. While in ICIM, the background 
genetic variation is controlled, so that the sampling variance only comes from 
random error, which in turn improves the accuracy of parameter estimation and the 
power of QTL detection. 
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5.2 Inclusive Composite Interval Mapping in DH 
Populations 


5.2.1 Additive Genetic Model of One Single QTL 


In the additive genetic model at one single QTL (called locus Q, and Q and q rep- 
resent the two alleles at this locus), genotypic values or phenotypic means (G) of two 
homozygous genotypes QQ and qq at the locus can be expressed as, 


G = u+ aw (5.2) 


where u represents the average value of two homozygous genotypes QQ and qq, a is 
the additive effect, and w is the indicator variable of QTL genotype, with w = 1 for 
genotype QQ and w = —1 for genotype qq. 

Before the QTL is mapped, the genotype at the QTL is unknown for any DH line 
in the mapping population, the additive effect of the QTL is to be estimated, and 
therefore equation 5.2 cannot help too much in parameter estimation. However, 
marker genotypes are known, which provide information on QTL genotype due to 
the linkage relationship between marker and QTL. Therefore, it is necessary to seek 
the relationship between the QTL genotype and marker genotype by taking the 
advantage of the constructed linkage map. Assume that the QTL is located between 
two polymorphic marker loci A and B, and the genotypes of the two parents are 
AAQQBB and aaqqbb. The bi-parental DH population has four different genotypes 
at loci A and B, and frequencies of two QTL genotypes under each marker type can 
be derived from the recombination frequencies between QTL and two marker loci. In 
fact, the joint frequencies of QTL genotypes in table 4.8, chapter 4 when divided by 
the frequency of marker type will give the frequencies of two QTL genotypes under 
the corresponding marker type, which are shown in table 5.2. Based on the values of 
w in table 5.2, and the probabilities of QTL genotypes under various marker types, 
the conditional expectation of QTL indicator variable w under various marker types 
can be calculated, i.e., 


2rLTR a 
E(wlay = 1, 2R m 1) — Sow x Prfulax = 1, R = 1} = 1 e 2f, 


RI a 


E(ulay = 1, = —1) = X` w x Pr{wla, = 1,2 = -1} = 2 = f, 
TL — TR 
E(ulay = —1, æ = 1) = X` w x Pr{wlay = -1, 2 = 1} = = —— 
2 
E(ulə = —1, æ = —1) = $ w x Pr(ulm = 1,23 = 1} = -1+ MR -fi 
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TAB. 5.2 — The expected frequencies of QTL genotypes and conditional expectations of QTL 
indicator variables under four marker genotypes at two adjacent marker loci in the DH 
population. 


Marker Frequency Marker Conditional probabilities Conditional 
genotype variables of QTL genotypes expectation 
T, TR QQ (w=1) qq (w = —1) Bula. ta) 

1 2 1— nv — rh ETTR .— JER 2rLTR 
AABB z307”) Bob Cu 1-r PU ae = 

1 _ (l=7)m _ m(1— mr) Mm > TL 
AAbb ə” 1 en Ma ee ae r 

1 TÜ — m) (1- n)m MNR 
aaBB a" zi 1 T31 = : T32 = z - 

1 — LTR 1- r- mE 21L R 
aabb 20-r) =i =] "A151-r T42 Tar 14 Ter 


Note: It is assumed that locus Q is located between markers A and B with the recombination frequency denoted by r; rr, 
and ra are the recombination frequencies between Q and the left marker, and between Q and the right marker, 
respectively. Assume that there is no crossover interference, i.e., r = rL +TR — 2rLrh. 


Thus, the conditional expectations of QTL genotypic indicator w can be 
expressed by the three recombination frequencies. Define two other functions of 
recombination frequencies, i.e., Ar, and Ag in equation 5.3. It is not difficult to verify 
that the expectation of w can be expressed as a linear function of marker indicators 
öy and Zp in terms of Ay, and dp, i.e., equation 5.4. 


r — TL + TR — 27LTRh, 


a = 5 Ui +h) = 


2r(1 — r) , 
) ay je a aed eet (5.3) 
n= 5 — b) 2r(1 — r) 
E(ulay, tR) = Aka + AR aR (5.4) 


For example, marker A, locus Q, and marker B are located on the same chro- 
mosome at 10 cM, 16 cM, and 30 cM, respectively. From the conditional frequencies 
of QTL genotypes given in table 4.10, chapter 4, the expected values of the QTL 
genotype indicator under four marker genotypes are calculated to be 0.9835, 0.3978, 
—0.3978, and —0.9835, respectively. Defining Ar = (0.9835 + 0.3978) /2 = 0.6906 
and A, = (0.9835-0.3978) /2 = 0.2928, the linear relationship E(urlar, zr) = 
0.69062, + 0.29282 is obtained. 

Based on equation 5.4, the conditional expectation of genotypic value G under 
each marker genotype under the single-locus additive model (equation 5.2) can be 
expressed by the linear model in equation 5.5. Coefficients of marker indicator 
variables in equation 5.5 are treated to be the major effects of two flanking markers, 
denoted as Ar, and Ap, respectively. Thus, the conditional expectation of genotype 
value is expressed as a linear model of marker major effects in equation 5.6. 


E(Glm, m) = u+ aE(ulm, m) = w+ alyın + a/RaR (5.5) 
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E(G\az, ap) = w+ Apar + Aha, where AL = añ; and Ah = alg (5.6) 


From the previous derivation, it can be seen that coefficients Ar, and Ap in 
equation 5.6 contain information not only about the position of QTL but also about 
the additive effect of QTL. If marker coefficients in equation 5.6 can be estimated, 
the relative position of QTL in the marker interval can be inferred, and its additive 
genetic effect can be estimated. 


5.2.2 Additive Genetic Model for Multiple QTLs 


For simplicity, it is assumed that two homozygous parents Pı and Py» differ at a 
number of m QTLs, being distributed in m intervals flanked by m + 1 markers. The 
parental QTL genotype is assumed to be Qı Qı Q2 Qs... QmQm for Pi, and qı qiqəqə... 
mm for Pə. In the bi-parental DH population, X = (t1, zə, ..., Zm, Tm+1) represents 
marker indicator variables, taking values 1 and —1 for the two homozygous marker 
types of P; and Ps, respectively. W = (un, wo, ..., Wm) represents the QTL indicator 
variables, the genotype of Pj is indicated by 1, and the genotype of Pə is indicated by 
—1. Additive effects of QTLs are represented by aş, dy, ..., ayn. Assuming that the 
effects of individual QTLs are additive, genotype value G of one DH line in the 
mapping population can be expressed as, 


m 


G= b+ 5 aj Wj; (5.7) 


j=l 


In QTL mapping, the expectation of QTL genotype indicator uz in the DH line is 
related to the position of the jth QTL on the chromosome, and the length of the 
marker interval (equations 5.3 and 5.4). To recognize different QTLs, equation 5.4 is 
re-written as equation 5.8, where Ajr) and Ajr) are functions of recombination 
frequencies between the jth marker and the jth QTL, between the jth QTL and the 
(j + 1)th marker, and between the jth and the (j + 1)th markers. Parameters Az and 
An are calculated by equation 5.3. 


E(w;|X) = Agu) 2) + ÀR) +1 (5.8) 


When marker genotypes are known, the expected genotypic value G can be 
expressed as a linear function of marker variables, i.e., 


m m+1 
E(G|X) = u+ 5 a; (Ait) Tj + Ay(R) Tj+ 1) = Bo + 5 Bizi (5.9) 
del del 


where 
Bo = y, Pi = Aaya, 
B; = A(j—-1)(R) d)—1 + Z/L) aj(j = 2; e.s m), and 


Pree 1 — Am(R) dm 
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From the formulas on marker coefficients as given in equation 5.9, it can be seen 
that the coefficient of the jth marker is only affected by the additive effect of the 
QTL in marker intervals (j — 1, j) and (j, j + 1). If there are no QTLs in the left and 
right intervals adjacent to (j, j + 1), regression coefficients $) and $): in the marker 
interval (j, 7 + 1) only include the information on the position and additive effect of 
the QTL located in the interval (j, j + 1). This is actually the theoretical basis for 
the identification of additive QTLs in ICIM and other regression-based mapping 
methods. 

Assuming that there are a number of n DH lines in the mapping population, 
phenotypic values of the trait in interest, and genotypic data of m + 1 ordered 
markers are known. Based on the genetic model in equation 5.7, and the linear 
model in equation 5.9, the following linear regression model can be obtained in 
mapping the additive QTLs, 


m+1 


yi = E(GİX) +e: = Bot 3) Bizi t ei (5.10) 
j=l 


where i = 1, 2, ..., n; nis the DH population size; y; is the phenotypic value of the 
ith DH line; fo is the constant term; $) is the partial regression coefficient of 
phenotype on the jth marker variable; x is indicator variable of the jth marker in 
the ith DH line, the parent Pı marker type is represented by 1, and the parent Py» 
marker genotype is represented by —1; and g; is the residual effect, which is assumed 
to follow a normal distribution with mean 0 and variance o?. From the derivation 
process, it can be seen that under the assumption that the effects of individual QTLs 
satisfy additivity, the regression coefficient of one marker depends only on QTL 
located in two adjacent intervals of the marker, not on QTLs located in other 
intervals. By this property, the linear model given in equation 5.10 provides the 
theoretical basis for the ICIM method to achieve background control. 


5.2.8 One-Dimensional Scanning and Hypothesis Testing 
for Additive QTLs 


The basic idea behind ICIM is to use the genome-wide marker information to build a 
linear regression model representing the relationship between genotype and phe- 
notype (i.e., equation 5.10), to control the background genetic variation by 
adjusting the original phenotypic values, and then to perform the interval mapping 
on adjusted phenotypic values. Considering that the number of QTLs is always 
much lower than the number of markers, stepwise regression is used to select the 
most important marker variables. The coefficients of unselected markers in stepwise 
regression are set to 0. Parameters included in equation 5.10 are estimated only 
once. If the current scanning interval is (k, k + 1), the observed phenotypic value is 
adjusted as follows, 


Ayi = Yi — 5 bizi, 2— 2 (5.11) 
j#k,k+1 
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where Ay; is the adjusted phenotypic value, x is the indicator variable of the jth 
marker in the ith DH line, and b, is the estimate of parameter f; in equation 5.10. 
If the population size is large enough, and there is no QTL in intervals adjacent 


to (k, k + 1), estimates By and On aq only contain information on the position and 
additive effect of the QTL located in the interval (k, k+ 1). Therefore, in the 
subsequent interval mapping, the adjusted phenotype Ay; includes the full infor- 
mation about the QTL position and additive effect in the current scanning interval. 
At the same time, by introducing the regression coefficients of other markers into 
equation 5.11, effects of QTLs in other intervals and chromosomes are controlled. 
The adjusted phenotypic value Ay; will not change until the scanning position moves 
into the next marker interval. In contrast to IM, ICIM does not make the 
assumption that there is at most one QTL on a single chromosome. The only 
assumption is “isolated QTLs”. That is to say, between two linked QTLs there is at 
least one empty marker interval (i.e., without QTL); or, the linked QTLs are iso- 
lated by at least one empty marker interval (Li et al., 2007; Whittaker et al., 1996). 

Parameter estimation and hypothesis testing in ICIM are similar to those in IM, 
as have been introduced in detail in chapter 4. The only difference is that the 
phenotypic values are replaced with their adjusted values obtained by equation 5.11. 
Same to IM, ICIM detects the presence of QTLs through one-dimensional scanning 
along the whole genome. When scanning at a specific chromosomal position, 
markers other than the two in the current interval are used to adjust the phenotypic 
values. If there is a QTL located at the current scanning position (with two alleles 
denoted by Q and q), distributions of two QTL genotypes QQ and qq are denoted by 
N(u, o?) and N(uş, 62), respectively. Each marker genotype is a mixture distri- 
bution of QTL genotypes QQ and qq. Proportions of QQ and qq in each marker 
genotype are determined by recombination frequencies between the QTL and two 
flanking markers, and recombination frequency between two flanking markers 
(table 5.2). Whether there is a QTL at the current scanning position can be 
determined by testing the following two hypotheses, 


HA : fy F My, and A : hi = My (5.12) 


Let 5; represent the set of DH lines belonging to the kth marker genotype, where 
k = 1, 2,3, and 4 correspond to the four marker genotypes in table 5.2. If the marker 
genotype of the ith DH line belongs to the kth genotype, it is denoted as i € S; by 
the set theory. Therefore, the log-likelihood function under the alternative 
hypothesis FA can be written as, 


4 


In La = 7 Inf f(Ayis m, 02) + maf (Ayi; nə, 62) (5.13) 


k=1 i€S;, 


where /(e: uq, 02) and f(e;1,07) represent the density functions of normal 
distributions N(u, e?) and N(uş, o?), followed by the two QTL genotypes QQ 
and qq, respectively. 
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The EM algorithm is needed to calculate the maximum likelihood estimates of 
parameters 4, Hə, and o? in equation 5.13. The estimating procedure is similar to 
IM as introduced in chapter 4, which will not be repeated here. The maximum 
likelihood estimates are represented by fn, kə, and G?, respectively. Using the esti- 
mates of two QTL genotypic means, i.e., fl; and fly, additive effect a of the QTL can 
be estimated. From the genetic model given in equation 5.2, the relationship 
between the QTL genotypic means and additive effect is, 


Hi S HE, fy =a (5.14) 
Therefore, 
1 1 
Hə (H + Hə), a = ou — uz) (5.15) 


Substituting 44 and fy in equation 5.15 with their estimates fi, and ft» gives the 
estimate of additive genetic effect of the QTL. 

Under the null hypothesis Ho, all Ay; follows the same normal distribution 
N (uo, 02), and the log-likelihood function can be written as, 


In Lo = 5 In f(Ayi; Uo; o) (5.16) 
i=l 


Maximum likelihood estimates of distribution mean and variance in equa- 
tion 5.16 are, 


R 1 5 1 . 
Îo = oD Ayi, ô; = — (Ayi — o)” (5.17) 


Maximum likelihood estimates under the two hypotheses are substituted into 
equations 5.13 and 5.16 to have the two maximum likelihoods, by which the 
likelihood ratio test (LRT) statistic and LOD score are calculated, and the signifi- 
cance test for the null hypothesis is performed accordingly. 


5.2.4 Application of ICIM in a DH Mapping Population 
in Barley 


One population in barley (Hordeum vulgare L.) is composed of 145 DH lines derived 
from two parents Harrington and TR306, which is well-known in QTL mapping 
studies. Using genotypic data from 127 polymorphism markers, a linkage map 
covering the seven barley chromosomes (denoted by 1H-7H) was constructed. From 
1992 to 1993, the performance of a number of quantitative traits was evaluated 
under 25 environmental conditions in 17 locations (Tinker et al., 1996). Figure 1.7 in 
chapter 1 shows the genotypic data of 14 markers on chromosome 1H, and table 4.3 
in chapter 4 shows the average kernel weight (KWT) of the 145 DH lines. KVT is 
used here as an example to illustrate the application of ICIM in additive QTL 
mapping. KWT is 38.7 mg in parent Harrington, and 45.0 mg in parent TR306. 
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The lowest, average, and highest values on KWT in the DH population are 36. 
46 mg, 42.51 mg, and 48.46 mg, respectively. The phenotype of KWT is continu- 
ously distributed in the DH population, and transgressive segregation is observed in 
both directions (figure 5.2), which is typical of most quantitative traits used in QTL 
mapping. 


Parent TR306 was coded 
as 0, and average kernel 
weight is 45.0 mg 


Parent Harrington was 
25 coded as 2, and average 
kervel vveight is 38.7 mg 


Number of DH lines 
a 
“--------- 


mal 
T T T T T T 1 


36 37 38 39 40 41 42 43 44 45 46 47 48 
Mid-group value of kernel weight (mg) 


© n 
= 


Fic. 5.2 — Frequency distribution of kernel weight in 145 DH lines of barley. 


During the stepwise regression of model selection, probabilities for significant 
marker variables to enter in (PIN) and leave out (POUT) of the regression model 
were PIN = 0.001 and POUT = 0.0002, respectively. At each scanning position, the 
adjusted phenotypic values were used in interval mapping. Figure 5.3 shows the 
profiles of LOD score, additive effect, and phenotypic variance explained 
(PVE) from the whole genome scanning. On the LOD score profile, clear peaks are 
observed on chromosomes 2H, 3H, 5H, and 7H, with the highest one located on 
chromosome 5H, and the second highest on chromosome 7H. Additive effects at 
the peak positions are either positive or negative, and roughly speaking, the higher 
the LOD score at the peak, the greater the absolute value of the estimated additive 
effect. PVEs of QTLs at different peaks are positively correlated with LOD scores 
and the absolute values of additive effects. In other words, the QTL with a larger 
genetic effect explains more phenotypic variation, results in a higher LOD score, and 
therefore is easier to be detected. 

When the LOD score threshold is set at 2.5, seven significant peaks are identified 
from the LOD score profile in figure 5.3, three on chromosome 2H, one on 3H, one on 
5H, and two on 7H. Thus, seven QTLs on kernel weight are detected by ICIM. 
Positions of the seven significant peaks are viewed as estimates of QTL positions, 
effects at the seven significant peaks are viewed as estimates of additive effects of the 
detected QTLs, and PVEs at the seven significant peaks are viewed as estimates of 
PVEs of the detected QTLs. 

For convenience, the detected QTLs are better given suitable names. It is 
commonly accepted to name a QTL by prefixing the abbreviation of trait with the 
lowercase letter ‘q’, followed by an identifier of the chromosome where the QTL is 
located. Similar to the naming convention of genes, the QTL name is indicated in 
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Fic. 5.3 — Inclusive composite interval mapping for additive QTLs on kernel weight in a 
barley population consisting of 145 DH lines. Notes: In stepwise regression, the probability of 
the variable entering the model (PIN) is 0.001, and the probability of the variable leaving out 
of the model (POUT) is 0.002. 


the italics format when representing one particular allele or gene at the locus. If the 
name only represents the chromosomal locus, still the regular format is used in this 
book. If more than one QTL is detected on one chromosome, the QTL names are 
then differentiated by adding sequential numbers. Names of the seven QTLs on 
kernel weight detected in the barley DH population are shown in the first column in 
table 5.3, and the other columns in the table give the location on the chromosome, 
two nearest neighbor markers, LOD score, genetic effect, and other relevant infor- 
mation characterizing the detected QTL. 

qKVVT5H located at 5 cM on chromosome 5H and qKVVTTH located at 96 cM on 
chromosome 7H are the two QTLs with the largest additive effects, explaining 43.92% 


TAB. 5.3 — Results from additive QTL mapping on kernel weight in the barley DH mapping 
population by ICIM (PIN = 0.001, POUT = 0.002). 


QTL name Position on Nearest Nearest LOD Additive PVE 
chromosome marker on marker onthe score effect (%) 
(cM) the left right (mg) 
‘qKWT2H-1 84 Pox BCD351B 4.10 045 03463: 
qKVVT2H-2 138 ABC620 MVVG882 6.36 -0.56 6.21 
qKVVT2H-3 190 BCD4538 ABG317 5.45 0.53 5.60 
qKVVT3H 27 Ugp2 Ugp1 3.23 0.38 2.99 
qKVVT5H 5 Act8B MWG502 31.33 —1.47 43.92 
qKWT7H-1 4 iPgd1A BCD129 T.41 —0.59 7.07 


qKWT7H-2 96 MWG626 VAtp57A 18.35 1.01 20.53 
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and 20.53% of the phenotypic variation, respectively (table 5.3). The additive effects 
of four QTLs having the largest additive effects are negative, indicating that the 
alleles at these QTLs that increase kernel weight come from the parent coded as 0, 
i.e., TR306 which has a higher kernel weight than the other parent (figure 5.2). 
Although the kernel weight of parent Harrington is lower than TR306, it still carries 
the alleles that increase kernel weight at three QTLs, i.e., qKVV"T2H-1, qKVVT2H-3, 
and qKVVT3H. The fact that both parents have the alleles which increase kernel 
weight explains the transgressive segregation observed in figure 5.2. 

The linear regression model in equation 5.10 contains a large number of marker 
variables. However, the number of QTLs is rather limited, and most marker intervals 
do not harbor any QTL. The first step to use ICIM is to determine which markers 
should be included in the linear regression model. In statistics, there are a number of 
variable selection methods, and stepwise regression is one most commonly used 
(Broman and Speed, 2002; Stuart et al., 1999; Miller, 1990). There are two 
parameters to be specified in variable selection by stepwise regression. One is the 
significant probability of the variables entering into the model, denoted by PIN, and 
the other one is the significant probability of the variables leaving out of the model, 
denoted by POUT. PIN determines which variables should enter into the regression 
model, and POUT determines which variables should leave out of the regression 
model once a new variable is added to the model. By experience, POUT can be set at 
twice as much as PIN. Obviously, different marker variables may be selected by 
different values of PIN and POUT, and therefore the mapping results may be dif- 
ferent either. 

In figure 5.3, PIN = 0.001 and POUT = 0.002. To further understand the 
influence of different parameters on the mapping results from ICIM, figure 5.4 gives 
the LOD score profiles from three probability levels, i.e., PIN = 0.001, 0.01, and 
0.05, and POUT is two times PIN. It can be seen that although PIN varies 
from 0.001 to 0.05, there are still great similarities in the LOD score profiles, indi- 
cating that ICIM is relatively robust to the change in mapping parameters. 


| ICM, PIN= 0.05; R?=0.81 
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ICIM, PIN= 0.01; R?=0.80 | 
g 120 m 
3 100 nf” n— 
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o T NNNNA ias N 
40 + 
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j Fa a xana x iydi A dəə İNA izm oz nay imi qəmə 
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Whole genome scanning along the seven chromosomes of barley with step of 1 cM 


Fic. 5.4 — Comparison of LOD score profiles from ICIM under three probability levels in 
stepwise regression, and LOD score profile from IM. Notes: For clarity, 50 is added to the LOD 
score when PIN = 0.001, 100 is added to the LOD score when PIN = 0.01 and 150 is added to 
the LOD score when PIN = 0.05. 
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When PIN = 0.001 and POUT = 0.002, the regression model explains 73% of the 
phenotypic variation, which has exceeded the broad-sense heritability of kernel 
weight (i.e., 0.71). Therefore, it can be concluded that additive effects are the major 
source of genetic variation in kernel weight in this population. Inter-genic epistasis is 
less important. See §5.6 for more details on the choice of suitable values of PIN and 
POUT in the ICIM method. 


5.3 Inclusive Composite Interval Mapping 
in F, Populations 


For simplicity, while developed, most QTL mapping methods (here we mean the 
linkage mapping approaches in bi-parental populations derived through the con- 
trolled fertilization rather than association mapping in natural populations) use the 
bi-parental population such as BC, DH, or RIL as an example, where only two 
genotypes are present at each genetic locus. In populations consisting of two distinct 
genotypes at each locus, QTL mapping is focused on the additive genetic effect, even 
though the effect may have different meanings in different populations. Let Pı and 
Pə represent the two homozygous parents. For example, in the BC population where 
P, is used as the recurrent parent, the additive effect at a specific locus is defined as 
half of the difference between the P; genotype and the F; genotype. In DH or RIL 
population, the additive effect is defined as half of the difference between the Pı 
genotype and the Pə genotype. Fə populations have been widely used in genetic 
studies. Each locus has three genotypes in F» populations, allowing the estimation of 
both additive and dominant effects at the same time. For sure, more genetic effects 
will inevitably make the QTL mapping procedure more complicated. Zhang et al. 
(2008) reported that dominance can unexpectedly complicate the mapping proce- 
dure by causing interactions between markers. As a result, the interactions detected 
between markers may be caused by the dominant effect of a QTL, rather than the 
actual epistasis between QTLs. However, Fə populations, which also include the 
immortalized Fy populations composed of a large number of single crosses between 
the bi-parental DHs or RILs (Wang, 2017; Zhang et al., 2022), have to be used when 
studying the genetic basis of heterosis. In this section, we introduce the inclusive 
linear model that includes the interaction variables between two flanking markers, 
capable of completely absorbing both additive and dominant effects of QTLs. Based 
on the linear model, we introduce the ICIM algorithm suitable for QTL mapping in 
Fə populations. 


5.3.1 Additive and Dominant Model of One Single QTL 


For one additive and dominant QTL (Q and q are the two alleles at locus Q) in F, 
populations, the genotypic value of an individual with a known genotype at the 
QTL, i.e., QQ, Qq, or qq, is written by, 


G = u+ aw+ du (5.18) 
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where u is mean of the two homozygous genotypes QQ and qq, a is the additive 
effect, dis the dominant effect, and ur and v are indicators of QTL genotypes valued 
at 1 and 0 for QQ, 0 and 1 for Qq, and —1 and 0 for qq. When performing QTL 
mapping, the QTL genotype is unknown; that is, w and v in equation 5.18 are 
unknown, and genetic parameters a and d are to be estimated. 

Assume that locus Q is located between two co-dominant markers A and B. 
Genotypes of two parents are AAQQBB and aaqqbb. There are nine classes of 
marker genotypes in the Fə population, and frequencies of the three QTL 
genotypes under each marker class can be determined by recombination fre- 
quencies between the QTL and two markers (table 5.4). Indicator variables are 
denoted by zr and zr, for the left marker (i.e., locus A), and zr and zp for the 
right marker (i.e., locus B). Values of z and z are similar to those of w and v for 
the QTL, i.e., z — 1 and z= Ü represent the marker genotype of parent Pj, 
z — Ü and z= 1 represents the heterozygous marker genotype, and x= 1 and 
z = 0 represent the marker genotype of parent Py. For the genetic model defined 
in equation 5.18, the expectation of genotypic value G under each marker class is 
given in equation 5.19. 


E(Glm, mR, 2L, 2R) = u+ a x E(ulm, an, x, 2R) (5.19) 
"dx E(ulm, TR, x, 2R) ` 


Expected values under various marker classes are calculated from equation 5.19, 
which are given in the last column in table 5.4. By considering the two flanking 
markers as two genetic loci, the following effects are defined: additive effects of the 
two markers, i.e., Ar, and Ah, dominant effects of the two markers, i.e., Dg and Dr, 
and four interaction effects between the two markers, i.e., AA, AD, DA, and DD. Let 
u+ A represent the overall mean of four homozygous marker genotypes. Therefore, 
the following relationship between the expected values of marker classes and marker 
effects are obtained, 


u+tfia+ gd 1 1 1 0 0 1 0 00 u+A 
u+ha+ gd 1 1 001 0 1 00 AL 

u+ fa gd 1 1 -1 00 -1 0 00 An 

ub frat gd 1 0 1 1 0 0 10 DL 

u+gd =|l 0 0411 0 0 1] x | DR (5.20) 
u— frat gad 1 0 -1 1 0 0 -1 0 AA 

Mu — frat god 1 -1 1 0 0 -1 0 00 AD 

u — frat gd 1-1 00 1 0 -1 0 DA 
u—fiatgqd 1 -1 -1 00 1 0 0 DD 


TAB. 5.4 — Expectations of two QTL genotype indicators, and phenotypic mean in each marker class in the Fə population. 


Marker 


class 


AABB 


AABb 


AAbb 


AaBB 


AaBb 


Aabb 


aaBB 


aaBb 


aabb 


Frequency 


gr -r) 


0 - 2:42”) 


gr =r) 
Lr? 

gr -r) 
za ıı öl 


Marker indicators 


T, TR ŽL ZR 


=E Lap 0 


Expectation of w under each marker class, 
E(ulm, TR, x, an) 


1- 2n,m,,//(1-— r)=f 


[l= 2) — nl/(r— 7?) =f 
(mr — m)/re$ 
(l= n) - 2m) /(r = ə 
0 
ml m)ü - 2m)/(r — r?) = fa 
—(rr — m)/r-A 
(1 - 2n)m( — m)l/(r- r? = -h 


-1-2r.n/(1 — r) - -f 


Expectation of v under each marker class, 
E(olm, ap, xx, zr) 


Phenotypic mean or 
genotypic value 


2n(1- n)m(l = m)/(1- rəq xe had gid 
m(l- m)(1- 2m F2m)/(r— r?) gə ut fat gd 
2n.(1— 11) RCL — tr) /1?= 95 ut frat gd 
(1 = 2n, + 2/6) (1 — te) /(r = 9?) 94 ut fiat md 
(1 — 2m, 4-27) )(1 — 2m +2rf)/(1 — 2r+2r7)495 w+ gsd 

(1- 2n + 2/2) (1— m)/(r- 1?) = g u- frat ud 
2n.,(1 — m)mil — m)/r” = gs u- frat gd 
n- m)(1 -2m + 2r3)/(r — r?) = gə u- fat md 
2n(1— n)rr(1— m)/(1- r)? = g u—fiatgid 


Note: One QTL is located between two co-dominant markers A and B. r, rr, and rg are the recombination frequencies between the two markers, between the QTL and left flanking 


marker, and 


fi =3(fi — f) from the results in the table. 


etween the QTL and right flanking marker. Assume that there is no crossover interference, i.e., r = 7+ rn — 2rrr. It can be easily seen that f = 2 (fi +f) and 


016 


Surddeyy əuər) pue sısAyeuy əSeyur? 
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By solving equation 5.20, the following relationship is obtained, i.e., 


1 
uz z (gi + g3)d 


2 
fra 
1 
u+A 5 i -h)a 
AL 
1 1 
AR (-;n-pm+n)d 
DL 
D — 1 5.21 
fr (-ja+e-3e)i (6.21) 
AD 1 
DA z — 93)d 
0 
0 


1 1 
7 d 
G g= 2+ on + o) 


It can be seen from equation 5.21 that additive effect a of the QTL only causes 
additive effects on flanking markers, namely Az, and An. Dominant effect d of the 
QTL not only causes dominant effects on flanking markers, namely D; and Dr, but 
also additive by additive interaction AA, and dominant by dominant interaction DD 
between the two flanking markers. Therefore, if the two-way analysis of variance 
shows significant interactions between markers, it does not necessarily indicate that 
there are epistatic interactions between QTLs that control the phenotypic trait in 
interest. The interaction between markers may be caused by the dominant effect of 
QTL, which should be cautious, especially in epistasis studies on quantitative traits 
(Zhang et al., 2008; Hua et al., 2003). 

Without considering two terms AD and DA which have the value of 0 in equa- 
tion 5.21, the equation is re-written as, 


(gi + 93) 


. (fi — fs) 

a y at +a) 

bə. “50754 4 

pi | = 2 2 (5.22) 
1 ie 1 

22 ght 2 5% 


(gi — 93) 


— 


> 
m 
n tol — I 


N 


1 
o got z” gat s) 
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Expectations of the indicator variables of QTL genotypes, i.e., w and v, under 
each marker class can be expressed as, 


E(ular, TR, xx, R) = Am + ARQ, 


: (5.23) 
E(olay, TR, xx) 2R) = Ô + PL2 + PRR + ŽATLIR + PPR 


Thus, the relationship between the genetic value of each marker class and marker 
effects are obtained in equation 5.24, where variables zy, £R, 2, and zg together with 
two product terms əy, and z2r are known. By solving equation 5.24, we can 
acquire the estimates of unknown parameters including position and effects of the 
QTL, and achieve the goal of QTL mapping in Fp populations. 


E( Gaz, ay, x, 2) — +A t Arar t ARR t Dre, t Drzr t AA, rr + DDAR 
(5.24) 


Equation 5.24 is a completely fitted model, and coefficients in the model contain 
all the information regarding one position and two genetic effects of the QTL. In 
other words, the additive and dominant effects of the flanked QTL are completely 
absorbed by the six variables in equation 5.24. The two non-zero interactions, 
caused by the dominance effect, indicate that marker variables by themselves cannot 
completely absorb the effects of QTL located between the two markers. 


5.3.2 Additive and Dominant Model for Multiple QTLs 


Assume that a number of m QTLs are segregating in two homozygous parents P; 
and Po, being distributed in m intervals flanked by m + 1 markers. Ignore the case 
where there are multiple QTLs in one marker interval. Parental QTL genotype is 
assumed to be QQ, Q2Qo...QmQm for Pi, and 41 9¢2q2---GmQm for Po. In the 
bi-parental F> population, X = (zy, zə, ..., Zm, Im+1) and Z = (zi, 22) ..., Zm 2m+1) 
represent the marker indicator variables, taking values 1 and 0 for the homozygous 
marker genotypes of parent P4, 0 and 1 for the heterozygous marker genotypes, and 
—1 and 0 for the homozygous marker genotypes of parent Py. W = (un, w, ... Wm) 
and V = (uy, v2, ... Um) represent the indicator variables of unknown QTL geno- 
types. Values of wand vare similar to those of marker variables x and z, respectively. 
Additive effects of QTLs are represented by aş, d, ..., dm, and dominant effects are 
represented by dı, də, ..., dm. Assuming that the effects from individual QTLs are 
additive, genotypic value G of one F individual is expressed as, 


G= u+ 5 [ajwj + djv;] (5.25) 
j=l 


When marker genotypes are known, the expected genotypic value G can be 
expressed as a linear function of marker variables, i.e., 


E( G(X, Z) = w+ X [A+ Ajay ay + Amt + Diya + Dimza (5.26) 
del l 


+ AAjtjtj+1 + DDjzjzj+ il 
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By rearrangement of equation 5.26, the linear relationship between the expected 
genotypic value and marker variables is obtained, i.e., 


m+1 m+1 
E(G|X, Z) = po + 2. Bist əə. (5. 27) 
gel J= j= 


where 


Bo = u+ So Ay Bi = Ar: yı = Di, 


gel 
b; = Azı) + Ajay; ?) = Dir) + Din, J = 2,..., m, 


Bazi = = An > Ym+1 = Dar: 
Bi = AA, 7404” DD, j=1,...,m 


Based on the genetic model in equation 5.25 and the linear model in equa- 
tion 5.27, the following linear regression model between the observed phenotypic 
values and known marker types can be obtained and then used in QTL mapping in 
Fy 75 

= E(G|X,Z)+ 


m+1 m+1 m (5.28) 
= Bot ə Bizik 2 7)2i + ə... x... 


where, y is the observed value of a quantitative trait in interest; x and zj are known 
marker indicators; $, and y; are to be estimated, representing the 20 and 
dominant effects of the jth marker, respectively; f; ;, and y;;4 1 are to be estimated 
either, representing the additive by additive, and dominant by dominant interac- 
tions between the jth and (j+ 1)th markers, respectively; and £ is the 
normally-distributed random error. 


5.3.3 One-Dimensional Scanning and Hypothesis Testing 
in Additive and Dominant QTL Mapping 


Similar to ICIM in DH populations as has been introduced in the previous section, 
the first step of ICIM in Fə populations is still to use stepwise regression to estimate 
the marker effects in the linear model given by equation 5.28. The second step is to 
use equation 5.28 for background control, and search for the additive and dominant 
QTLs by one-dimensional scanning along the whole genome. Assuming that the 
current scanning position is within the marker interval (k, k + 1), the observed 
phenotypic value is adjusted as follows, 


Ay= yi 2. [jeit l- Y [Bigs tutes + jji] (5.29) 
ök,kal jek 


If there is one QTL in the marker interval, distributions of three QTL genotypes 
QQ, Qq, and qq are N(un, o2), N(uş, 07), and N(13, 07), respectively. Whether there 
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is a QTL at the current scanning position can be determined by testing the following 
two hypotheses, 


Ho : by = by = bs, 
HA : Hn, Hə, H3, at least two distribution means are not equal to each other 


(5.30) 


Log-likelihood function under the alternative hypothesis H4 is, 


9 
In LA = 5 5 İz /(Agi:ın, 03) + nrf (Ayi Ho, ez) + nrf (Ayi uş, ez) (5.31) 


k=1 iE Sy 


where S, represents the set of kth marker class (k = 1, 2, ..., 9), zais the proportion 
of the th QTL genotype in the kth marker class, l = 1, 2, and 3 represent the three 
QTL genotypes, and f(e; uz, 02), f(e; tə, 62), and f(e; tiş, 62) are the probability 
density functions of normal distributions N(u,, o?), N(uş, o)), and N(uş, o?), 
followed by the QTL genotypes QQ, Qq, and qq, respectively. 

The EM algorithm is used to calculate the maximum likelihood estimates of 
parameters in equation 5.31. The estimating procedure is similar to IM as introduced 
in chapter 4, which will not be repeated here. Maximum likelihood estimates are 
represented by fı, fig, fiş, and 62, respectively. From the genetic model given in 
equation 5.18, the relationship between means of QTL genotypes and genetic effects 
of the QTL is, 


H = La, p = H+, p =u- a (5.32) 
Thus, 


1 1 1 
H =z (t+ M3), a 5 (tt — Hs), d = p — 5 (H + ps) (5.33) 


Substituting Hu, Hə, and ug in equation 5.33 with the maximum likelihood esti- 
mates fl, fig, and fiş, respectively, gives the estimates of additive effect a and 
dominant effect d of the QTL. 

Under the null hypothesis Ho, the adjusted phenotypic values follow the same 
distribution, i.e., N (9,02), and the log-likelihood function is, 


In Lo = XO In[ (Agi to, 02)] (5.34) 
i=l 


Maximum likelihood estimates under the two hypotheses are substituted into 
equations 5.31 and 5.34 to have the two maximum likelihoods, by which the LRT 
statistic and LOD score are calculated, and the significance test for the null 
hypothesis is performed accordingly. 


5.3.4 Application of ICIM in an Fə Mapping Population 


Figure 1.8 in chapter 1 shows the genotypic data of 12 markers on the first chro- 
mosome in an F, population, and table 4.6 in chapter 4 shows the phenotypic data of 
110 individuals in this population. Parent P is coded by the single letter A 
(equivalent to the number code of 2), and the average phenotypic value is 21.0. 


Inclusive Composite Interval Mapping 215 


Parent P, is coded as 0, and Parent P, is coded as 2, and 
phenotypic mean is 19.0 phenotypic mean is 21.0 


ı ı 
1 ı 
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Fic. 5.5 — Phenotypic frequency distribution of an Fə population composed of 110 
individuals. 


Parent Pə is coded by a single letter B (equivalent to the number code of 0), and the 
average phenotypic value is 19.0. Phenotypic values of the Fə individuals show a 
typical continuous distribution, with the lowest, average, and highest values being 
16.14, 23.23, and 19.66, respectively. Transgressive segregation is observed in both 
directions (figure 5.5). 

Figure 5.6 shows the profiles of LOD score, PVE, additive effect, and dominant 
effect from two mapping methods, i.e., IM and ICIM, with a scanning step at 1 cM. 
From the phenotypic distribution in figure 5.5, it is impossible to learn the number 
of genes that control the phenotypic trait, needless to say, to estimate the genetic 
effects of these genes. However, it can be clearly seen from the LOD score profile of 
ICIM that there are three obvious peaks on chromosomes 1-8, indicating that there 
is one QTL controlling the phenotypic trait on each of the three chromosomes 
(figure 5.6A). From the profile of additive effect, it can be seen that the additive 
effects at the two peaks on the first and third chromosomes are close to 1, whereas 
the additive effect at the peak on the second chromosome is close to —1. From the 
profile of the dominant effect, it can be seen that the corresponding dominant effects 
at the three peaks are not large. By comparing the LOD score profiles from the two 
methods, clearly, ICIM with background control yields more distinct and higher 
peaks; thus, QTLs can be mapped to narrower intervals on chromosomes. 

When the LOD score threshold is set as 2.5, the ICIM method identifies three 
QTLs, each on chromosomes 1-3 (figure 5.6). Table 5.5 shows the estimated posi- 
tions and genetic effects at the three significant peaks. According to the naming 
criterion introduced in §5.2.4, using ‘Trt’ to represent the abbreviation of the trait 
in interest, the three QTLs are named qTrtl, qTrt2, and qTrt3. qTrt1 and qTrt3 
have positive additive effects, while qTrt2 has a negative additive effect, indicating 
that both parents have alleles that could increase or decrease the phenotypic values. 
Compared with additive effects, the dominant effects of the three QTLs are rela- 
tively small. Mapping results given in table 5.5 provide a reasonable explanation for 
the transgressive segregation observed in the phenotypic distribution in figure 5.5. 
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Fic. 5.6 — Profiles of LOD score (A), PVE (B), additive effect (C), and dominant effect 
(D) from one-dimensional scanning on four chromosomes for a phenotypic trait in an F, 
population of 110 individuals. Notes: The thin line is the profile from IM, and the thick line is 
the profile from ICIM (PIN = 0.001, R? = 0.60). The scanning step is 1 cM. 


TAB. 5.5 — Mapping results of additive and dominant QTLs on one phenotypic trait in an Fə 


population of 110 individuals by ICIM (PIN = 0.001, POUT = 0.002). 
Chromosome Position Nearest Nearest LOD PVE Additive Dominant 
(cM) left right score (70) effect effect 
marker marker 
1 28 M1-8 M1-9 10.67 21.19 0.9962 0.0529 
55 M2-12 M2-13 5.79 10.60 —0.9573 —0.1819 
3 26 M3-4 M3-5 13.26 28.33 1.1031 0.0660 


5.4 Type II Error in Hypothesis Testing and Statistical 
Power in QTL Detection 


5.4.1 Type II Error and Statistical Power in Hypothesis 
Testing 


In §4.3, chapter 4, we introduced in detail how the appropriate threshold of LOD 
score can be determined in QTL mapping by evaluating the distribution of test 
statistics under the null hypothesis of non-QTL, aiming to control the probability of 
the genome-wide type I error below a given level. The use of more stringent 
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threshold values can further reduce the probability of making type I error. But on 
the other hand, higher threshold values may lead to more false inferences for those 
QTLs having relatively smaller effects. This type of error, i.e., making the accep- 
tance decision of the non-QTL hypothesis where true QTLs are actually present, is 
called type IL In any scientific research, for obvious reasons, it is expected that the 
probabilities of both types of error are controlled at low levels. For a true QTL, if the 
null hypothesis is rejected, the QTL is detected successfully. The probability of 
successfully detecting the presence of a true QTL is called the detection power, 
which is equal to 1 minus the probability of type II error. Let a and J represent the 
probabilities of type I and II errors, respectively. Therefore, 1 — 6 is equal to the 
detection power. 

Probabilities of two types of error and statistical power are considered to be the 
key elements in hypothesis testing. As examples, binomial distribution (representing 
one of the discrete distributions) and normal distribution (representing one of the 
continuous distributions) are used here to further illustrate their statistical meaning 
and calculating method. Assume that the number of trials is n = 6 from a binomial 
distribution, and the number of successes is represented by random variable X tak- 
ing possible values x = 0, 1, ..., 6. Assuming that the probability of success in each 
trial is p, i.e., the event of success happened in one trial follows the Bernoulli 
distribution (also known as the two-point distribution), the number of successes 
(X) from n independent trials follows the binomial distribution, which is denoted as 
X ~ B(n, p). When n = 6, the probability function of random variable X is, 


Pr(X = a) = Cfp"(1 - p) (5.35) 


where Cë = FG) is the number of combinations. Now, assume a statistical test is to 
be performed against the null hypothesis Ho: p = 0.5, based on the observed number 
of successes from the six trials. The second column in table 5.6 shows the 
distribution probabilities of X = x(a = 0, 1, ..., 6) under the null hypothesis. When 
the null hypothesis is true, the maximum probability of 0.3125 is achieved at x = 3. 
If the n independent trials are repeated many times, X could be equal to 3 exactly in 
some repeats, but greater or less than 3 in other repeats. If the number of repeats is 
large enough, there may be cases where the number of successes is equal to 0 or 6, 
but the probability of success in one trial is still equal to 0.5. 

In binomial distribution B(n = 6, p = 0.5), the probability that X is equal to 0 or 
6 is rather low, i.e., 0.03125 (sum of the two probabilities at X = 0 and 6 in 
table 5.6). Therefore, set {0, 6} can be defined as the rejection region, based on the 
principle of the small-probability event in statistics; that is, if X in one experiment of 
n = 6 Bernoulli trials is equal to 0 or 6, Hp: p = 0.5 is rejected, and the success 
probability p of the binomial distribution is considered to be different from 0.5. 
Corresponding to the rejection region is the acceptance region, represented by set 
{1, 2, 3, 4, 5}; that is, if X in one experiment falls in the region from 1 to 5, Ho: 
p = 0.5 is accepted, or the success probability p in the binomial distribution is not 
significantly different from 0.5. Type I error occurs when the null hypothesis is true 
but rejected. Its probability is equal to that the test statistic under the null 
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TAB. 5.6 — Probabilities of the number of successes from three binomial distributions, t.e., 
B(n, p = 0.5), B(n, p = 0.75), and B(n, p = 0.9) where n = 6. 


X Ho: p = 0.5 Ay: p = 0.75 Hy: p = 0.9 
0 0.015625 0.000244 0.000001 

1 0.093750 0.004395 0.000054 

2 0.234375 0.032959 0.001215 

3 0.312500 0.131836 0.014580 

4 0.234375 0.296631 0.098415 

5 0.093750 0.355957 0.354294 

6 0.015625 0.177979 0.531441 
Probability of type I error 0.031250 Not applicable Not applicable 
Probability of type II error Not applicable 0.821777 0.531442 
Statistical power Not applicable 0.178223 0.468558 


hypothesis falls within the rejection region. Therefore, by using the rejection and 
acceptance regions to make the statistical inference, it can be guaranteed that the 
probability to make the type I error is equal to a = 0.03125. 

Type II error is based on the fact that the alternative hypothesis is true, and its 
probability is equal to that the test statistic falls within the acceptance region when 
the alternative hypothesis is true. Therefore, to calculate the probability of type 11 
error, we must first know exactly what the alternative hypothesis is. For example, 
for alternative hypothesis Hy: p = 0.75, the probability that X = 1-5 is equal to 
0.821777. When X falls in this region, the null hypothesis Hp: p = 0.5 is false but is 
still accepted. The probability to make such an error is equal to $ = 0.82177. The 
probability that X = 0 or 6 is equal to 1 — $ = 0.178223. When X is 0 or 6, the null 
hypothesis Hp: p = 0.5 is rejected, and the alternative hypothesis H4: p = 0.75 is 
accepted instead. Therefore, in binomial distribution where the alternative 
hypothesis H4: p = 0.75 is true, the probability to make the correct inference is 
equal to 0.178223, which is referred to as the statistical power in testing the alter- 
native hypothesis. In another example where the alternative hypothesis is Hy: 
p = 0.9, it can be seen from table 5.6 that the probability to make the type II error is 
equal to $ = 0.531442, and the detection power is equal to 1 — 6 = 0.468558. It is 
easy to see that the smaller the difference between the two hypotheses, or the closer 
the alternative hypothesis is to the null hypothesis, the greater the probability of 
making the type II error (that is, 2 is closer to 1 — a), and the lower the detection 
power (that is, 1 — 2 is closer to the probability of type I error, i.e., a). On the 
contrary, the greater the difference between the two hypotheses, or the farther the 
alternative hypothesis is from the null hypothesis, the lower the probability of type 
II error and the higher the detection power. 

The binomial distribution is discrete, and its distribution probability is relatively 
simple. The following gives another example of probabilities of two types of error, 
and the statistical power from the significance test on the population mean of a 
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normal distribution, which is continuous. Assuming that a test statistic X follows a 
normal distribution with a mean of u and a variance of 1 (when the variance is not 
equal to 1, the standardized transformation can be applied), the probability density 
function is given by, 


1 1 
je . (a ur) (5.36) 


Assume that the null hypothesis Hy: u = 0 is to be tested for its significance. 
When the null hypothesis is true, one observed value of random variable X should 
not be too far away from the population mean of zero. From the probability function 
of the standard normal distribution N(0, 1) (figure 5.7), we know that the proba- 
bility of event {|X| > Zo.o25 = 1.96) is equal to 0.05; that is, Zo.o25 = 1.96 corre- 
sponds to the critical value of the testing statistic when the right-tail probability is 
equal to 0.025. Define {X; |X| > 1.96) as the rejection region; that is, when 
|X| > 1.96, Ho: u = 0 is rejected, or the population mean is declared to be signifi- 
cantly different from 0. The acceptance region is, therefore ( X, |X| < 1.96}; that is, 
when |X| < 1.96, Ho: u = 0 is accepted, or the population mean is not significantly 
different from 0. By using the rejection and acceptance regions defined above in 
inference, it can be guaranteed that the probability of type I error in one single test 
does not exceed 0.05. 

Same as in the case of discrete distributions, the alternative hypothesis has to be 
known in order to calculate the probability of type II error. For alternative 
hypothesis H4: u = 1, the probability that X falls in the acceptance region is equal 
to 0.83, that is, 


1.96 0.96 —2.96 


8- İ  fmu=lde= fsa — 0)4r - f für, — ü)dr 


—1.96 —oo —oo 


= 0.8315 — 0.0015 = 0.83 


When X is located between —1.96 and 1.96, the null hypothesis Hp: u = 0 is 
accepted even though it is false. Therefore, the probability of type I error is equal to 
$ = 0.83, and the testing power is equal to 1 — $ = 0.17. Similar to the case of discrete 
distributions, it can be seen from figure 5.7 as well that the smaller the difference 
between the two hypotheses, the larger the shaded area of the figure on the right, and 
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Fic. 5.7 — Calculation of the probabilities of type I and type II errors in testing the null 
hypothesis Hp: N(0, 1) against the alternative hypothesis HA: N(1, 1). 
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the greater the probability to make the type II error, and the lower the detection 
power. On the contrary, the greater the difference between the two hypotheses, the 
lower the probability to make the type IJ error, and the higher the detection power. 


5.4.2 Probability of Two Types of Error 
and the Appropriate Sample Size 


In hypothesis testing, the first thing to control is the probability to make type I 
error. The major purpose of investigating the probability of type II error is to 
evaluate the power of one statistical test so that the pros and cons of different testing 
or statistical methods can be compared. For example, at the same significance level, 
if the probability to make the type II error when using test (or method) A is lower 
than that when using test (or method) B, we can say that the power or ability to 
detect the true alternative hypothesis by method A is higher than the power by 
method B. Therefore, method A is statistically considered to be better than method 
B. One other purpose of power analysis is to determine an appropriate sample size 
before the experiment is conducted. 

The significance test of the difference between two population means from two 
normal distributions with equal and known variance, i.e., N (4,0?) and N(uş, 07), is 
used as the example. Assume one random sample with the size of n is available from 
each population, and the two sampling means are denoted by X} and X». Assuming 
the null hypothesis is Ho: 44 = Mə, the statistic to test the significance of the null 
hypothesis follows a standard normal distribution, given in equation 5.37. 

x N(0, 1) (5.37) 
Therefore, the rejection region at the significance level a is defined by 


Er”. > Zayə, where Z,/2 corresponds to the critical value of standard normal dis- 
20 


tribution when the right-tail probability is equal to a/2. Under the alternative 
hypothesis H4: uq A H, let ô = “—* represent the difference between two population 
means by the unit in standard deviation o. It is assumed that 6 > 0 for convenience. 


When H; is true, the distribution of the test statistic is given by equation 5.38. 
—, 1) (5.38) 


If the probability of type II error is controlled under $, or equally, the statistical 
power of the test is no less than 1 — 2, the right-tail probability corresponding to 
the critical value given by equation 5.38, i.e., Z,/2, has to be no less than 1 — £. 
That is to say, 
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ô 
ApS Zaj — m” (5.39) 


n 


In equation 5.39, the right-tail probability corresponding to Z_, in standard 
normal distribution is equal to 1 — £; and the right-tail probability corresponding to 
Zaj2 — -0— in standard normal distribution is equal to the testing power. As long as 


v 
the relationship given in equation 5.39 is satisfied, the detection power would be no 
less than 1 — £. In addition, from the property of symmetry of normal distributions, 
we have Z;_, = — Zg. Thus, the relationship of sample size (n) depending on prob- 
abilities of two types of error (a and 6), and the actual difference between population 
means (ô) values is established as follows. 


2 
> —— (5.40) 
Table 5.7 gives the minimum sample sizes required to detect some differences 
between two population means (which are given) at two significance levels and three 
detection powers. It can be seen that at the same significance level, the higher the 
testing power is needed, the larger the sample size needed either. For example, for a 
difference at one standard deviation (that is, ö = 1) and a significance level at 0.05, 
at least 16 observations are needed from each population to have a probability of 0.8 
to detect the difference. If a probability of 0.95 is required, at least 26 observations 
are needed from each population. It is easy to see from table 5.7 that the large 
difference can be detected without using too many observations. For example, for a 
difference twice of the standard deviation, that is, ö = 2, only a few observations are 
needed. On the contrary, for small differences, such as 6 = 0.2, hundreds of obser- 
vations are required in order to have the testing power higher than 0.8. 


TAB. 5.7 — The minimum sample size required to detect a given difference between two 
population means (ô) at given significance level (a) and given detection power (1 — £). 


ö a = 0.05 a = 0.01 

1-f=08 1-2-—09 1-f=0.95 1-f£=08 1-f=09 1-f=0.95 
0.2 392 525 650 584 744 891 
0.4 98 131 162 146 186 223 
0.6 44 58 72 65 83 99 
0.8 25 33 41 36 46 56 
1 16 21 26 23 30 36 
1.2 11 15 18 16 21 25 
1.4 8 11 13 12 15 18 
1.6 6 8 10 9 12 14 
1.8 5 6 8 7 9 11 
2 4 5 6 6 it 9 
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To determine the required sample size from equation 5.40 or table 5.7, it is 
necessary to have a rough idea in advance about the difference to be tested. For 
example, in a multi-environment trial in crops, the yield of the check is about 500 kg 
at a given area of the field, and it is desired to test a new breeding line which has a 
potential yield of about 520 kg by a power of 0.90 and at a significance level of 0.05. 
Based on experiments from previous years, it is known that the standard deviation 
of each observation is about 10 kg, or the variance of random errors in the field is 
100 kg”. Therefore, the difference to be tested is two times the standard deviation, 
and at least five replications are required to achieve the required testing power under 
the required significance level. If there is little prior knowledge on variances of the 
tested populations and the difference between population means, and the cost to 
acquire one observation is relatively high, a small sample size of the experiment can 
be carried out first, by which means and variances of the tested populations can be 
roughly estimated. If the non-significance inference is made from the preliminary 
experiment, further experiments with a larger sample size can be then considered. 


5.4.8 Distribution and Effect Models of QTLs Used 
in Power Analysis by Simulations 


As has been seen from the previous two sections, for some simple hypothesis tests, 
statistical power at a given significance level can be estimated from distributions of 
the test statistic under null and alternative hypotheses. For more complex 
situations, such as the statistical tests in QTL mapping, a number of QTLs may 
be present at different chromosomal positions in one mapping population; alterna- 
tive hypotheses are many when the null hypothesis (7.e., non-QTL) is false; the 
whole genome has to be tested and therefore a large number of tests are conducted. 
These factors are put together to make it impossible to evaluate the genome-wide 
significance and estimate the detection power by mathematical deductions. In such 
cases, we can only count on computer simulation, which is also referred to as the 
Monte Carlo simulation. 

Usually, there is more than one QTL controlling a quantitative trait. When 
multiple QTLs are present, some of them may be linked to the same chromosomes. 
When two QTLs are linked, two phases can be differentiated, 2.e., the linkage in 
coupling when the effects of two linked QTLs are in the same direction, and the 
linkage in repulsion when the effects of two linked QTLs are in the opposite direc- 
tions. It is impossible to list exhaustively the possible scenarios of positions and 
effects of QTLs on a phenotypic trait in interest. We assume that there are six 
chromosomes in the genome, each of a length of 120 cM in the genetic distance, and 
there are 13 markers evenly distributed on each chromosome. Table 5.8 shows four 
possible scenarios of QTL positions and effects, each having four QTLs. In genetic 
models of independent inheritance, the four QTLs are located at 35 cM on four 
different chromosomes. In genetic models of linkage, Qı and Q» are located at 35 cM 
and 65 cM on one chromosome, and Q; and Q; are located at 35 cM and 65 cM on 
the other chromosome. 
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In the independent model I, all positive-effect alleles at the four QTLs come from 
one parent, and all negative-effect alleles come from the other parent. In indepen- 
dent model II, each parent has two positive-effect alleles and two negative-effect 
alleles. In the bi-parental DH population, genetic variance with independent models 
is, 


4 
Ve = Soa; (5.41) 


i=1 


where a; represents the additive effect of the ith QTL. For the effects given in 
table 5.8, the total genetic variance in the DH population is Vg = 1. Under the 
assumption that the error variance is equal to 1, the heritability of the phenotypic 
trait is equal to 0.5, and PVEs of four QTLs are equal to 5%, 10%, 15%, and 20% for 
both independent models (table 5.8). 

In linkage model I, Qı and Qə have positive effects and are located on one 
chromosome; Qə and Qa have positive effects either but are located on the other 
chromosome. They represent two pairs of QTLs linked in the coupling phase. In 
linkage model II, Qı and Qə are linked on one chromosome and have effects in 
opposite directions; Q3 and Qa are linked on the other chromosome and have effects 
in opposite directions either. They represent two pairs of QTLs linked in the 
repulsive phase. Let rand c represent, respectively, the recombination frequency and 
map distance (cM) between two linked QTLs. Haldane’s mapping function, i.e., 
r — 1(1 — e”€/90), is used to convert the map distance (c) to recombination fre- 
quency (r), or vice versa. Genetic variance in the two linkage models can be cal- 
culated by equation 5.42. 


4 
Va = 5 aş + 2(1 = 112) Q a2 + 2(1 = 134) dg a4 (5.42) 


where a, is the additive effect of the ith QTL, rz is the recombination frequency 
between Qı and Qə, and r34 is the recombination frequency between Qə and QA. 
From equation 5.42, we have Vg = 1.536 for linkage model I, and Vg = 0.465 for 
linkage model II. Under the assumption that the error variance is equal to 1, the 
heritability of the phenotypic trait is equal to 0.606 for linkage model I, but 0.317 for 
linkage model II (table 5.8). 

To repeat once again, genetic variance and heritability are parameters that are 
defined and calculated for the specific genetic population. It does not make much 
sense to talk about genetic variance, heritability, and genetic covariance and cor- 
relation without referring to the population. For the same set of QTLs with constant 
genetic effects, they may produce different genetic variances and heritabilities in 
different genetic populations and linkage relationships, as shown from an indepen- 
dent model I and linkage model I, or from independent model II and linkage model IT 
in table 5.8. 
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TAB. 5.8 — QTL positions and effects in four genetic models used in power simulation. 


QTL and related parameters 


Qı 

Qə 

Qs 

Qu 

Genetic variance 
Error variance 
Heritability 


Genetic variance 
Error variance 
Heritability 


Genetic variance 
Error variance 
Heritability 


Genetic variance 
Error variance 
Heritability 


Chromosome Position (cM) 


Independent model I 


1 35 
2 35 
3 35 
4 35 
1.000 

1.000 

0.500 
Independent model II 
1 35 
2 35 
3 35 
4 35 
1.000 

1.000 

0.500 

Linkage model I 

1 35 
1 65 
2 35 
2 65 
1.536 

1.000 

0.606 

Linkage model II 
1 35 
1 65 
2 35 
2 65 
0.465 

1.000 

0.317 


Additive e 


0.316 
0.447 
0.548 
0.633 


0.316 
—0.447 
0.548 
—0.633 


0.316 
0.447 
0.548 
0.633 


0.316 
—0.447 
0.548 
—0.633 


ect PVE (96) 
5.0 
10.0 
15.0 
20.0 


5.0 

10.0 
15.0 
20.0 


3.9 
7.9 
11.8 
15.8 


6.8 

13.7 
20.5 
27.3 


5.4.4. Calculation of the Detection Power and False 


Discovery Rate in QTL Mapping 


As with any other statistical hypothesis tests, two types of error occur in QTL 
mapping either. Firstly, one QTL may be wrongly declared to be present at a 
chromosomal position where no QTL is actually located, i.e., the false positive 
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happens, which belongs to the type I error. Secondly, non-QTL may be wrongly 
concluded at a chromosomal position where the QTL is actually present, i.e., the 
false negative happens, which belongs to the type II error. 

The probability of type I error can be controlled by choosing the appropriate 
LOD score thresholds (see §4.3, chapter 4). The probability of type II error is 
determined by the size of the mapping population, and the size of QTL effects. 
Statistical power is equal to the probability to reject the null hypothesis when the 
alternative hypothesis is actually true. m QTL mapping, detection power indicates 
the probability for the true QTL to be detected, which is one of the most important 
indicators that can be used to evaluate and compare the efficiencies of different 
mapping methods (Li et al., 2010). In one genetic population, there is usually a 
number of QTLs that controls a phenotypic trait in interest, which are located at 
different positions in the genome. Genome size also varies greatly from one species to 
the other, with the number of chromosomes ranging from just a few to dozens, and 
the length of a single chromosome ranging from tens to hundreds of cM. Genes that 
control the trait are usually located on a limited number of chromosomal segments, 
and most regions in the genome do not have any QTLs. 

To better compare different mapping methods, the concept of false discovery rate 
(FDR) was proposed. In simulation studies, FDR is defined as the proportion of false 
positive QTLs among the totally detected QTLs (Li et al., 2007, 2012; Zhang et al., 
2008; Benjamini and Hochberg, 1995). For example, in one genetic population, if the 
IM method detects five QTLs on the maturity of rice, of which two are false posi- 
tives, FDR from IM is equal to 0.4. In the same population, the ICIM method 
detects 10 QTLs on maturity, of which 3 are false positives, FDR from ICIM is equal 
to 0.3. From the definition, only when the detected QTLs can be distinguished as 
either true or false positives can the FDR and detection power be calculated. In 
practical populations, it is generally difficult to do so. However, in populations that 
are simulated from one pre-defined genetic model, positions and effects of true QTLs 
are known in advance. By comparing the QTLs detected in the simulated population 
with the pre-defined QTLs, which detected QTLs are true and which are false can be 
determined, and thus the detection power of each pre-defined QTL and FDR can be 
calculated. For this reason, FDR and detection power are investigated only in 
theoretical studies on QTL mapping methodologies. Obviously, it is expected that 
the best mapping method has FDR to be as low as possible, and detection power to 
be as high as possible. 

From the genetic models defined in table 5.8, a large number of DH populations 
can be generated by computer simulation. In each simulated population, marker 
genotypes and QTL genotypes are known for any DH lines in the population. 
Therefore, genotypic values of the simulated DH lines can be calculated from 
equation 5.7. Error effects can be simulated from the normal distribution with a 
mean of 0 and a variance equal to the predefined error variance. Phenotypic values 
of the simulated DH lines are acquired by summing up the genotypic values with 
error effects. Up to now, the simulated population has both marker genotypes and 
phenotypic observations, which makes no difference from the actual mapping pop- 
ulations. QTL mapping is conducted in simulated populations just as in any actual 
mapping population, and the detection powers are then calculated. 
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LOD score is the test statistic used in the interval-based mapping method, such 
as IM and ICIM. QTLs are located at the significant peaks in the LOD score profile 
obtained by step-wise and one-dimensional scanning along all chromosomes in 
the genome. Hypothesis testing is conducted at every scanning position, and 
therefore the mapping procedure involves a large number of tests, which are not 
mutually independent due to the linkage. Even with the help of known positions and 
effects of pre-defined QTLs, counting true and false positives among the detected 
QTLs is not straightforward. When QTLs are closely linked, it becomes even more 
difficult to determine which detected QTL belongs to which putative QTL. The 
following two ways can be adopted to calculate the detection power in QTL map- 
ping. The first one is based on marker intervals in the genome. Assume one QTL is 
pre-defined in one marker interval. In one simulated population, if one QTL is 
detected in the marker interval, the putative QTL is declared to be correctly 
identified; otherwise, if no QTL is detected in the marker interval, the putative QTL 
is not found, which is an error. This method can help to track the distribution of true 
and false positives in the genome. The second one is based on a support interval 
(SI) for each predefined QTL. SI has a given length, where the putative QTL is 
located in the middle. Power can be determined by counting the simulated popu- 
lations where one QTL is detected in the given SI. Other QTLs detected outside of 
all support intervals of the putative QTLs are regarded as false positives. From each 
of the four genetic models in table 5.8, five DH populations each with a size of 200 
are simulated and used to illustrate the calculation of detection power and FDR in 
QTL mapping (figures 5.8 and 5.9). 

When four QTLs are located on different chromosomes (figure 5.8), IM and 
ICIM produce different peaks on the chromosomes with putative QTLs. The abso- 
lute value of the QTL effect determines the height of the peak. The larger the 
absolute value of the effect, the higher the peak at the detected QTL, but the height 
of the peak has nothing to do with the direction of the QTL effect. For linkage model 
I, though the peaks are still observed on chromosomes with QTLs, the two linked 
QTLs can be hardly separated on the LOD profile from IM (figure 5.9). For 
linkage model II, significant peaks can be hardly seen on the LOD profile from IM, 
let alone the detection of two linked QTLs. When ICIM is used, four significant 
peaks are identified in population 1 from linkage model I, and populations 2 and 3 
from linkage model II. The lower number of significant peaks than the number of 
putative QTLs in each simulated population (figure 5.9) indicates that the linkage 
of QTLs greatly reduces the power of QTL detection. 

For the five simulated populations from independent model I (figure 5.8 on the 
left), chromosomal positions of the detected QTLs, LOD scores, PVEs, and additive 
effects at the peak positions are given in table 5.9. In the first simulated population, 
three QTLs were detected by IM. The first one is located at 25 cM on the second 
chromosome where Qə is defined, but the position of the detected QTL is more than 
10 cM away from the putative QTL. If the QTL only detected within 5 cM to the 
left and right of Qə is considered to be true positive, the detected QTL is judged to 
be a false positive (table 5.9). However, if the length of SI is expanded to 20 cM 
(i.e., the QTL detected within 10 cM to the left and right of Qə is treated as the 
putative QTL is correctly detected), the detected QTL is judged to be a true 
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Fic. 5.8 — LOD score profiles of the one-dimensional scanning from IM and ICIM in five 
simulated DH populations from two independent genetic models (QTL positions and effects 
are shown in table 5.8). 


positive (table 5.9). The second one is located at 35 cM on chromosome 3, which 
overlaps with the position of Q3. This QTL is within both the 10 cM and 20 cM 
support intervals of Q3. The third one is located at 40 cM on the fourth chromo- 
some, and the distance from Q, is 5 cM, so this QTL is considered to have been 
detected within the two support intervals of Q4 (table 5.9). 

In the first simulated population, ICIM detected four QTLs. The first one is 
located at 47 cM on the first chromosome where putative Q: is assumed. But the 
estimated position is 12 cM away from the position of Q4, not in the support interval 
with the length of 10 cM or 20 cM. Therefore, the first detected QTL is viewed as a 
false positive (table 5.9). The second QTL is located at 38 cM on the second 
chromosome, and the distance to putative Qə is 3 cM. The detected QTL is within 
the two SIs (10 cM and 20 cM) of Qə, i.e., putative Qə is detected in the simulated 
population by either SI. The third QTL is located at 33 cM on the third chromo- 
some, and the distance from putative Qə is 2 cM. Therefore, putative Qə is detected 
in the simulated population by either SI. The fourth QTL is located at 38 cM on 
chromosome 4, and the distance from putative Qa is 3 cM. Therefore, putative QA is 
also detected in the simulated population. Based on the criteria described above, 
each QTL detected in each simulated population can count as either a true positive 
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Fic. 5.9 — LOD score profiles of one-dimensional scanning from IM and ICIM in five simu- 
lated DH populations from two linked-QTL genetic models (QTL positions and effects are 
shown in table 5.8). 


of a pre-defined QTL or a false positive, as shown in table 5.9. Finally, the number of 
simulated populations where each pre-defined QTL is correctly detected and the 
number of total false positive QTLs in simulated populations can be counted. The 
results for the five simulated populations are shown in table 5.10. 

Using the 10-cM SI, the four pre-defined QTLs in independent model I were 
detected by IM in one, three, four, and five out of the five simulated populations, 
respectively, with three detected QTLs being false positive (table 5.10). As far as the 
five simulated populations are concerned, the detection powers of the four putative 
QTLs are equal to 20%, 60%, 80%, and 100%, respectively. FDR is defined as 
the proportion of false positives to all detected positives, which in this case is equal 
to 3/1 +3 + 4 + 5 + 3) = 18.75% (table 5.10). Using the same SI and the same 
five simulated populations, ICIM detected the four putative QTLs in four, four, five, 
and four populations, respectively, with three detected QTLs being a false positive. 
Therefore, detection powers for Qı, Qə, Q3, and Q, are equal to 80%, 80%, 100%, 
and 80%, respectively, and FDR = 3/(4 + 4 + 5 + 4 + 3) = 15% (table 5.10). By 
increasing the length of SI, it can be seen from table 5.10 that the detection power is 
increased and FDR is decreased (table 5.10). 
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TAB. 5.9 — QTLs detected in five simulated DH populations for independent genetic model I. 
Method Simulated DH QTLs detected Support 
population interval (SI) 


Chromosome Location LOD PVE Effect 10 cM 20 cM 
(cM) score (%) 


IM 1 2 25 4.97 11.44 0.503 False Qə 
3 35 5.61 13.35 0.541 Qs Qə 
4 40 13.21 26.22 0.761 QA QA 
2 2 34 5.36 13.01 0.509 Qə Qə 
3 34 5.82 13.72 0.521 Qs Qə 
4 30 11.59 23.43 0.682 Qı QA 
3 1 39 5.05 11.22 0.508 Qı Qı 
2 32 4.30 10.09 0.482 Qə Q2 
3 54 8.03 18.42 0.651 False False 
4 36 8.06 18.55 0.653 Qa QA 
4 1 45 3.97 10.21 0.420 False Qı 
2 36 2.69 6.81 0.343 Qə Qə 
3 34 8.92 19.66 0.583 Qs Qə 
4 36 8.79 20.15 0.591 Q4 QA 
5 3 33 3.08 8.16 0.389 Qs Q3 
4 35 11.71 26.65 0.701 QA QA 
ICIM 1 1 47 3.80 5.06 0.335 False False 
2 38 6.79 9.11 0.448 Qo Qo 
3 33 9.70 13.81 0.551 Qs Q3 
4 38 16.72 25.50 0.753 QA QA 
2 1 35 4.65 6.26 0.352 Qı Qi 
2 36 9.07 12.56 0.500 Qə Qə 
3 31 7.93 10.41 0.454 Qs Q3 
4 27 16.77 24.93 0.703 False QA 
3 1 36 7.52 10.23 0.486 Qı Qı 
2 32 6.00 8.10 0.432 Qə Qə 
3 38 9.52 13.63 0.560 Q3 Qə 
4 38 13.05 19.18 0.664 QA QA 
4 1 30 3.99 5.13 0.298 Qı Qi 
2 37 4.04 5.89 0.319 Qə Qə 
3 33 14.21 21.68 0.613 Qs Q3 
4 36 13.73 21.23 0.607 QA QA 
5 1 35 4.91 8.04 0.384 Qı Qı 
2 51 4.35 6.87 0.356 False False 
3 34 5.35 9.45 0.419 Qs Q3 
4 35 17.46 31.65 0.764 QA QA 
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TaB. 5.10 — QTL detection power and false discovery rate (FDR) of IM and ICIM estimated 
in five simulated DH populations for independent model I defined in table 5.8. 
Method QTL Number of QTL detection 
populations where power for two 
the QTL is lengths of support 
detected for two interval (%) and 
engths of support FDR (%) 
interval and total 


false positives 


10 cM 20 cM 10 cM 20 cM 

IM Qi 1 2 20.0 40.0 
Qə 3 4 60.0 80.0 
Q3 4 4 80.0 80.0 
QA 5 5 100.0 100.0 
Number of false 3 1 18.75 6.25 
positives and FDR, 

ICIM Qi 4 4 80.0 80.0 
Qə 4 4 80.0 80.0 
Q3 5 5 100.0 100.0 
QA 4 5 80.0 100.0 
Number of false 3 2 15.00 10.00 


positives and FDR, 


5.5 Comparison of IM and ICIM by Simulation 


In the previous section, the definition and calculating method of statistical power in 
hypothesis testing are introduced, together with the use of simulation approaches to 
calculating the detection power and FDR in the case of QTL mapping, which is 
hardly investigated in theory. This section is focused on the comparison of ICIM 
with IM, by using the four genetic models defined in table 5.8. For the comparison 
with other mapping methods by using other genetic models, the readers can refer to 
Li et al. (2007, 2012) and Zhang et al. (2008). 


5.5.1 QTL Detection Power and FDR from IM 


Detection powers of putative QTLs and false discovery rates (FDR) from IM are 
shown in table 5.11, which are calculated from a total of 1000 simulated DH pop- 
ulations, each of a size of 200. Four genetic models, which are the same as those 
defined in table 5.8, are considered in defining the genotypic and phenotypic values 
of DH lines in each simulated population. The length of the support interval (SI) is 
set as 10 cM; that is to say, if one QTL is detected within the interval 5 cM on the 
left and 5 cM on the right of a pre-defined QTL, it is declared that the pre-defined 
QTL is correctly identified in the mapping population. As can be seen from 
table 5.11, the four putative QTLs were detected with similar powers in mapping 
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TAB. 5.11 — QTL detection power and false discovery rate (FDR) from IM. 


QTL and Power 
FDR (%) 


Independent model I 


Qı 25.8 
Qə 62.6 
Qs 77.7 
Qu 85.4 


FDR (%) 22.6 


Position 


(cM) 


35.182 
34.861 
35.006 
35.067 


Standard 
deviation 
of position 


3.461 
3.014 
2.669 
2.464 


Independent model II 


Qi 27.3 
Qə 64.2 
Qs 78.6 
Qu 84.6 


FDR (76) 311 


34.835 
35.062 
34.956 
34.865 


Linkage model I 


Qı 24.1 
Qə 49.0 
Qs 40.2 
Qu 56.5 


FDR (%) 53.1 


Linkage model II 


Qı 0.3 
Qə 25.3 
Q3 6.8 
Qu 40.6 


FDR (%) 38.9 


35.448 
64.790 
35.759 
64.165 


33.333 
66.534 
31.691 
67.746 


3.414 
3.035 
2.706 
2.481 


2.757 
2.549 
1.648 
1.624 


4.714 
3.220 
2.608 
2.680 


LOD 


score 


3.849 
5.084 
7.013 
9.205 


3.865 
5.006 
6.963 
9.374 


6.944 
7.859 
17.017 
18.682 


2.965 
3.667 
3.176 
4.169 


Standard 
deviation 
of LOD 


score 


1.003 
1.692 
2.178 
2.584 


1.151 
1.607 
2.143 
2.441 


2.092 
2.278 
3.105 
3.359 


0.431 
1.029 
0.553 
1.302 


Additive 
effect 


0.421 
0.480 
0.560 
0.635 


0.424 
—0.478 
0.558 
—0.640 


0.625 
0.665 
0.937 
0.971 


0.313 
—0.354 
0.334 
—0.373 


Standard 
deviation 
of effect 


0.056 
0.082 
0.092 
0.095 


0.062 
0.078 
0.090 
0.089 


0.099 
0.104 
0.090 
0.093 


0.013 
0.050 
0.031 
0.061 


Note: A total of 1000 DH populations are simulated each of the size of 200; the length of the 
support interval is 10 cM. 


populations simulated from two independent models. For independent model I, 
detection powers are equal to 25.8%, 62.6%, 77.7%, and 85.4%, respectively; while 
for independent model II, detection powers are equal to 27.3%, 64.2%, 78.6%, and 
84.6%, respectively. False discovery rates from the two independent models are 
similar as well, which are equal to 32.4% and 31.1%, respectively. The results 
indicate that in the case of independent inheritance, detection power depends on the 
genetic variance of the QTL, regardless of the direction of its additive effect. 
The greater the genetic variance of the QTL, the greater the PVE of the QTL, and 
the higher the detection power. 
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In linkage model I, Qı and Qə are located on the same chromosome, both with 
additive effects greater than 0, and they are linked in the coupling phase. Detection 
powers of Q; and Qə from IM are equal to 24.1% and 49.0%, respectively. Qz and Q4 
are located on another chromosome, which is also linked in the coupling phase. 
Detection powers of Q3 and Q, from IM are equal to 40.2% and 56.5%, respectively. 
False discovery rate from linkage model I is as high as 53.1%, indicating that 
more than half of the detected QTLs are not within support intervals of the four 
putative QTLs. In linkage model TI, Qı and Qə are linked on one chromosome in the 
repulsive phase. Detection powers are equal to 0.3% and 25.3% for Q, and Qə, 
respectively. Qə and Qa are linked on the other chromosome also in the repulsive 
phase. Detection powers are equal to 6.8% and 40.6% for Qz and Quy, respectively. 
Therefore, when two QTLs are linked in the repulsive phase, one with a smaller 
effect is more difficult to be detected. 

Because of the given LOD score threshold and SI, the estimated position and 
effect of the true QTL in the independent genetic model (see table 5.1) are close to 
the true position and effect (see table 5.8). The larger the genetic variance of the 
QTL, the higher the corresponding LOD score and the smaller the standard devi- 
ation of the estimates. The results show that a QTL with large genetic variance not 
only can be detected with high power but also its position and effect can be esti- 
mated with high accuracy. In the linkage genetic model, the QTL detection power 
and the accuracy of position and effect estimation are not only related to the effect of 
one single QTL but also depend on the distance between linked QTLs and directions 
of QTL effects. Coupling linkage increases the LOD score of IM, but also increases 
the bias and variance of the effect estimate, thus reducing the precision and accuracy 
in QTL detection. 


5.5.2 QTL Detection Power and FDR from ICIM 


Detection powers of putative QTLs and FDR from ICIM are shown in table 5.12, 
calculated from 1000 simulated DH populations, each of a size of 200. Four genetic 
models used in defining the genotypic and phenotypic data of simulated populations 
are the same as those defined in table 5.8. The length of the support interval (SI) is 
set as 10 cM. The four putative QTLs were detected with similar powers for the two 
independent models. For independent model I, detection powers are equal to 49.5%, 
73.9%, 82.8%, and 89.0%, respectively, higher than those from IM; for independent 
model II, detection powers are equal to 49.2%, 76.1%, 85.6%, and 90.0%, respec- 
tively, higher than those from IM either. The false positive rates from the two 
independent models are similar either, which are equal to 26.4%, and 23.8%, 
respectively, and lower than those from IM. In linkage model I, Qı and Qə are linked 
in the coupling linkage phase, and their detection powers are equal to 26.9% and 
55.5%, respectively; Q, and Q, are linked in the coupling linkage phase, and their 
detection powers are equal to 77.0% and 84.2%, respectively. False discovery rate is 
as high as 26.4%; that is to say, more than one-fourth of the detected QTLs are not 
within SI of any putative QTL. In linkage model II, Q) and Qə are linked in the 
repulsive phase, and their detection powers are equal to 11.6% and 33.0%, 
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TAB. 5.12 — QTL detection power and false discovery rate from ICIM. 
QTL and Power Position Standard LOD Standard Additive Standard 


FDR (%) (cM) deviation score deviation effect deviation 
of position of LOD of effect 
score 


Independent model I 


Qı 49.5 34.867 3.184 4.667 1.656 0.354 0.062 
Qə 73.9 34.874 2.769 7.156 2.295 0.450 0.077 
Qs 82.8 34.958 2.521 10.161 2.710 0.548 0.078 
QA 89.0 35.160 2.278 13.087 3.229 0.632 0.083 


FDR (%) 22.6 
Independent model II 


Qı 49.2 34.831 3.204 4.589 1.640 0.352 0.063 
Qo 76.1 35.030 2.861 7.142 2.328 —0.448 0.076 
Qs 85.6 35.051 2.484 10.193 2.755 0.548 0.081 
Q4 90.0 34.939 2.325 13.203 3.221 —0.634 0.082 


FDR (%) 21.3 
Linkage model I 


Qı 26.9 35.353 3.051 7.335 3.466 0.449 0.118 
Qə 55.5 64.872 2.701 10.519 4.184 0.558 0.133 
Qs 77.0 34.952 2.618 10.560 3.890 0.559 0.113 
Q. 84.2 64.828 2.533 13.668 4.761 0.649 0.130 


FDR (76) 26.4 
Linkage model II 


Qı 11.6 34.216 3.615 5.100 1.915 0.370 0.066 
Qə 33.0 66.179 3.053 5.872 3.188 —0.402 0.108 
Q: 56.2 34.383 2.894 8.332 3.381 0.492 0.104 
Q4 60.9 65.984 2.429 11.413 4.131 —0.591 0.114 


FDR (%) 23.8 


Note: A total of 1000 DH populations are simulated, each of the size of 200; the length of the 
support interval is 10 cM. 


respectively; Qə and Q; are linked in the repulsive phase, and their detection powers 
are equal to 56.2% and 60.9%, respectively. The false discovery rate is as high as 
23.8%. By comparing the results given in tables 5.11 and 5.12, it is obvious that even 
in the case of linked QTLs, ICIM still has higher detection powers and lower FDR 
than IM. 

Due to the setting and use of threshold LOD score and given-length support 
interval, estimated positions and genetic effects of putative QTLs from independent 
models (table 5.12) are close to their true values given in table 5.8. The larger the 
QTL genetic variance, the higher the corresponding LOD score, and the smaller the 
standard deviation of the estimates (table 5.12). Therefore, one QTL with larger 
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genetic variance not only has higher detection power but also has higher accuracy in 
the estimation of chromosomal position and genetic effect. When QTLs are linked, 
detection power and estimating accuracy not only depends on the genetic effects of 
the linked QTLs but also on the genetic distance between the linked QTLs and the 
directions of the QTL effects. Although linkage reduces the detection powers of 
ICIM, powers in detecting Qə, Q3, and Qa in linkage model I of the coupling phase 
are still above 50%; powers in detecting Q; and QA in linkage model II of the 
repulsive phase are still above 50%. 

In independent model I and linkage model I, the additive effects of the four 
putative QTLs are the same for the two models, but the linkage relationship 
between QTLs is different. It can be seen from table 5.12 that the LOD score 
statistic is fairly similar between the two models. LOD scores in table 5.12 are also 
more similar between independent model II and linkage model II, in comparison 
with those from IM as given in table 5.11. This is exactly what is expected to achieve 
by using background control. Through background control, ICIM eliminates the 
influence of genetic variation outside the current scanning interval. Test statistic at 
any scanning position depends only on the genetic variation of QTL located in the 
current scanning interval, and thus the power of QTL detection and estimating 
accuracy of QTL position and effect is improved, and false positive QTLs are 
reduced. 


5.5.8 Detection Powers Counted by Marker Intervals 


Detection powers and false discovery rate as shown in tables 5.11 and 5.12 were 
counted by support intervals corresponding to pre-defined QTLs. In addition, 
detection power can also be calculated by marker interval on each chromosome. The 
interval-based QTL mapping is conducted with the known genetic linkage map. In 
each simulated population, it is possible to count whether there is a QTL detected in 
each marker interval. By using this approach, the distribution of true and false 
positive QTLs across the whole genome can be monitored. Figure 5.10 shows the 
detection powers from IM and ICIM, calculated by marker intervals on chromo- 
somes 1—4, for the four genetic models defined in table 5.8. 

When there is no linkage between putative QTLs, intervals where the putative 
QTLs are located have the highest detection powers on the corresponding chro- 
mosomes, regardless of whether IM or ICIM is used. Random errors in QTL map- 
ping come from two sources. One is the deviation of phenotypic value from true 
genotypic value, which is caused by random factors in the phenotyping experiment 
and phenotypic measurement. The other one is the deviation of allelic and genotypic 
frequencies from their expected values, which is caused by the randomness of 
crossing-over and recombination events between two homologous chromosomes, and 
the random combination of female and male gametes during sexual propagation. 
Errors related to phenotypic data can be controlled by suitable experimental designs 
and accurate measurement methods. Random errors associated with allelic and 
genotypic frequencies are generally hard to control, and the major option to reduce 
their effects in QTL mapping is to increase the population size. Therefore, it is 


Inclusive Composite Interval Mapping 235 


IM ICIM 
100 Independent model I 1005 Independent model I 
g 80 4 = 80 4 
S S 
T 60 - S 60 
2 40 4 $ 40 4 
£ 20 + l £ 20 4 | 
ee | | | || o Elarre M r 
1111111111111222222222222233333333333334444444444444 1111111111111222222222222233333333333334444444444444 
independent mode independent mode 
100 Independent model II 100 5 Independent model II 
g 80+ = 80 
S S 
> 60 4 = gö - 
= 40 4 = 40 + 
È 20 4 l £ 20 + | 
ee | 
1111111111111222222222222233333333333334444444444444 1111111111111222222222222233333333333334444444444444 
5 Linkage model I 100 Linkage model I 
g 80 4 & 80 
x 60 4 = 60 
2 40 - | Il 2 40 - | 
£ 20 + | £ 20 4 
0 ‘oll Oe əm 0 oll a 
1111111111111222222222222233333333333334444444444444 1111111111111222222222222233333333333334444444444444 
100 Linkage model II 100 Linkage model II 
g 80-4 g 80+ 
ka S 
x 60 + x 60 5 
Ş 40 -- Ş 40 4 | | 
Ê 204 | Ê 20- 
ra EEE T EEEF | 0 | cle Manet eee 
1111111111111222222222222233333333333334444444444444 1111111111111222222222222233333333333334444444444444 
Marker intervals on the first 4 chromosomes Marker intervals on the first 4 chromosomes 


Fic. 5.10 — Detection power counted by marker interval from IM and ICIM. Notes: Detection 
power is obtained from 1000 simulated populations. Marker intervals where the four 
pre-defined QTLs are located are represented by solid bars. 


difficult to completely eliminate the effect of random errors in QTL mapping. Due to 
the influence of random factors, one QTL may be located far away from its true 
position. Both IM and ICIM detect some QTLs in nearby marker intervals of the 
putative QTLs (figure 5.10). But generally speaking, the more distant the interval is 
from the putative QTL, the less likely one QTL is detected. 

For coupling linkage between QTLs, i.e., linkage model I in table 5.8, it can be 
clearly seen from figure 5.10 that IM detects a large number of false QTLs in empty 
intervals between two linked QTLs. This is the “ghost” QTL phenomenon as 
introduced in §4.2.6, chapter 4. For repulsive linkage between QTLs, i.e., linkage 
model II in table 5.8, IM only detects a small number of QTLs. Linkage also reduces 
the detection power of ICIM. But in comparison with IM, ICIM still has relatively 
high detection powers on those intervals with putative QTLs. For intervals where 
there is no QTL located, ICIM detects much few QTLs, which can be counted as 
false positives if marker intervals having putative QTLs are treated as support 
intervals. 


5.5.4 Suitable Population Size Required in QTL Mapping 


As with the statistical tests in solving any other scientific problems, there are two 
major purposes to investigating false discovery rate and detection power in QTL 
mapping. The first one is to quantitatively evaluate and compare different mapping 
methods, and the second one is to determine an appropriate size for mapping 
populations. Table 5.13 shows the minimum size of RIL populations required for 
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ICIM to detect one QTL at a given marker density and statistical power. 
In table 5.13, two marker densities at 5 cM and 10 cM, and two statistical powers at 
0.8 and 0.9 are given; linkage between QTLs is not considered. Results in table 5.13 
were obtained from one extensive simulation study on QTL detection power 
(Li et al., 2010). Two criteria are used to determine whether the putative QTL is 
correctly detected. Firstly, the putative QTL is treated as being correctly detected 
as long as one QTL is detected on the chromosome with the putative QTL. This 
criterion is equivalent to using the entire chromosome as the support interval. 
Secondly, the putative QTL is treated as being correctly detected only when one 
QTL is detected within 5 cM to the left and right of the putative QTL, z.e., length of 
the support interval is 10 cM. 

To answer the question of suitable population size in QTL mapping, the first 
thing to know is the genetic effect of the target QTL and the accuracy of the 
mapping results. Assume the objective is to locate one major-effect QTL that 
explains more than 10% of the phenotypic variation by a detection power above 0.9 
and within a support interval of 10 cM, and marker density on the linkage map is 
about 5 cM on average. To achieve the above objective, a mapping population with 
at least 140 RILs is needed (table 5.13). If the objective is to identify one QTL that 
can explain 3% of the phenotypic variation, a mapping population with at least 380 
RILs is needed to achieve the same power and accuracy using the same linkage map. 
As with the QTL detection power, there are many factors that influence the 
determination of suitable population size. Only when the target of QTL detection is 
clearly defined can the suitable size be roughly given by considering the type of 
mapping populations, number of markers on the linkage map, length of the entire 
genome, and so on. 


TAB. 5.13 — Minimum size of bi-parental RIL populations required for the ICIM mapping 
method. 


PVE (%) of | Detected on the designated Detected within the 10 cM support 
QTL chromosome interval 


Marker density Marker density Marker density Marker density 


5 cM 10 cM 5 cM 10 cM 

Power Power Power Power Power Power Power Power 

0.8 0.9 0.8 0.9 0.8 0.9 0.8 0.9 
1 300 560 540 >600 >600 >600 >600 >600 
2 160 300 280 320 440 >600 460 >600 
3 110 200 180 200 260 380 300 440 
4 100 160 140 180 220 300 280 380 
5 80 140 120 140 180 300 220 320 
10 50 80 70 80 100 140 120 140 
20 40 60 50 60 80 120 100 120 
30 40 40 40 40 60 100 80 100 


Note: linkage was not considered. 
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5.6 Avoiding the Overfitting Problem in the First Step 
of Model Selection in ICIM 


ICIM consists of two steps. In the first step of model selection, important marker 
variables are selected, and their effects are estimated by stepwise regression using the 
information from all markers. In the second step of interval mapping, phenotypic 
values are adjusted by the regression model which has been established from the first 
step and then used to test the present of QTL at a given chromosomal position. 
ICIM represents an important step forward in QTL mapping that highlights the 
importance of model selection and interval-based testing. While conducting the 
interval test in ICIM, genetic variations in marker intervals other than the current 
one are controlled, resulting in a much higher LOD score than IM at the chromo- 
somal regions with QTLs, but a much lower LOD score where no QTLs are located. 
QTL detection power and FDR are calculated from the LOD score profiles, therefore 
ICIM shows much higher detection power but lower FDR than IM, which has been 
seen in the previous two sections. This mapping strategy also simplifies the process 
of controlling the background genetic variation in composite interval mapping 
(CIM) (Zeng, 1994). It has been demonstrated that ICIM has advantages over some 
Bayesian models as well (Zhang et al., 2008; Li et al., 2007, 2008; Yi et al., 2003). In 
addition, ICIM is relatively robust to mapping parameters and is more easily to be 
extended to epistatic QTL mapping and the analysis of QTL by environment 
interactions (Li et al., 2007, 2010, 2012; Wang, 2009; Zhang et al., 2008). In epistatic 
QTL mapping, not only can the interactions be detected between QTLs showing 
significant effects per locus, but also the interactions between QTLs without the 
marginal effects per locus. See chapter 6 for details on the two-dimensional scanning 
of epistatic QTLs, and the stability analysis of QTLs across environments. 

However, the selection of the most fitted regression model and the accurate 
estimation of marker effects in the first step of ICIM is indeed important to the 
second step of interval mapping. An ideal regression model of the phenotype on 
marker variables should meet the following two goals. Firstly, for each marker 
interval with QTL, marker variables on both sides of the interval should be included 
in the regression model (see equation 5.10 or equation 5.28). Secondly, if no QTLs 
are present in neighboring intervals on two sides of one marker, this marker should 
not be included in the regression model. For DH populations, the ideal regression 
model should completely explain all genetic variance caused by the additive effects of 
QTLs; for Fə populations, the ideal regression model should completely explain all 
genetic variance caused by both additive and dominant effects of QTLs. While 
scanning at a given marker interval, the influence of QTLs outside the interval is 
eliminated by the adjustment given by equation 5.11 or equation 5.29, but without 
losing any information about the QTL in the current scanning interval. 

For the regression models given in equations 5.10 and 5.28, more often the 
number of markers exceeds the size of the mapping population, especially when 
considering that the SNP arrays are being more and more used in genotyping the 
mapping populations nowadays. Therefore, it is not so easy to determine the ideal 
regression model as defined in equation 5.10 or equation 5.28, and accurately 
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estimate the marker effects included in the model. When the regression model 
cannot explain all or most genetic variation, the background control in the second 
step of ICIM would be incomplete. As a consequence, the detection power would be 
reduced, QTLs with smaller effects may not be detected, and more false negative 
errors may occur. When the regression model is over-fitted, i.e., some of the random 
errors are included in the model, more false positive QTLs may be present in the 
second step of interval mapping. Therefore, both under-fitting and over-fitting issues 
should be avoided in order to properly use ICIM in QTL mapping. By doing so, both 
false negative and false positive errors can be reduced. 

In practical mapping populations, the fitness (or goodness of fit) of the regression 
model can be roughly determined by comparing the coefficient of determination (R°) 
with the broad-sense heritability of the phenotypic trait in interest. The coefficient 
of determination is defined as the proportion of the sum of squares from regression to 
the total sum of squares, which can be regarded as the proportion of phenotypic 
variation explained by the regression model, or PVE of the model. Taking the barley 
DH population as an example, the broad-sense heritability of kernel weight is equal 
to 0.71, estimated from the replicated observations across multiple environments. If 
genetic variation on kernel weight is dominated by the additive effects of QTLs, R? 
of the ideal regression model should be similar to the broad-sense heritability of the 
trait. Table 5.14 shows the values of R” from four levels of probability (PIN) for 
variables to enter into the regression model. The probability for variables to leave 
out of the regression model is set as twice of PIN. When PIN = 0.001, R? is equal to 
0.7289, which slightly exceeds the heritability of kernel weight (7.e., 0.71). Table 5.3 
shows the detected QTLs on kernel weight, using PIN = 0.001 to choose the suitable 
regression model. If the higher PIN is used, the obtained regression model has a 
higher R? as well (table 5.4). Values of R” from the other three levels of PIN are 
much higher than the heritability of kernel weight, indicating the obtained regres- 
sion models have an over-fitting problem, i.e., some random errors have been fitted 
in the model. As shown in figure 5.4, the LOD score profiles from three levels (1.e., 
0.001, 0.01, and 0.05) of PIN have many similarities, indicating the robustness of 
ICIM to mapping parameters and the overfitting issue (figures 5.3 and 5.4). 

When PIN = 0.1, the R of the obtained regression model is close to 0.9, much 
higher than the heritability of kernel weight. Using this model to adjust the phe- 
notype values before interval mapping, profiles of LOD score and additive effect are 
acquired and shown in figure 5.11. It can be seen that in addition to the seven QTLs 
detected at PIN = 0.001, there are some new QTLs. Two pairs of QTLs are closely 
linked in the repulsive phase on chromosomes 1H and 3H, respectively. According to 
simulation studies conducted in §5.5, it is rather difficult to separate the linked 


TAB. 5.14 — Coefficients of determination of the 
regression models obtained in the barley DH 
population from four levels of PIN. 

PIN 0.001 0.01 0.05 0.1 

R? 0.7289 0.7963 0.8131 0.8886 
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QTL, especially at the repulsive phase. In fact, if recombination does not occur 
between two linked QTLs in a population, it is impossible to separate them by QTL 
mapping. The two QTLs may be not located at the same physical position on the 
chromosome, but genetically they behave as one co-segregating QTL. The separa- 
tion of linked QTLs requires a sufficient number of recombination events that 
happened in the mapping population. In a population with limited size, the closer 
the linked loci are to each other, the lower the possibility for recombination to 
happen. Even though some recombination events do occur, it is still difficult to 
identify the phenotypic difference between the recombinant and parental genotypes. 

As shown in figure 5.11, for the two QTLs linked in the repulsive phase on 
chromosome 3H, one is located at 141 cM with an additive effect of 0.45, and the 
other is located at 147 cM with an additive effect of —0.33. The recombination 
frequency between the two QTLs is approximately equal to 0.06. For the 145 DH 
lines in the population, about 8 or 9 lines are likely to be the recombination type, 
and the remaining lines belong to the parental type. In such a population, it is 
almost impossible to detect such a close linkage. On the other aspect, if closely linked 
QTLs are present in one population with a size of 100-200, the detected QTLs are 
highly likely to be false positives which may be caused by the overfitting problem. 
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Fic. 5.11 — False positive QTLs caused by an over-fitted regression model in the first step of 
ICIM in the barley population of 145 DH lines. Notes: In stepwise regression, the probability 
of the variables entering into the model (i.e., PIN) was set at 0.1, and the probability of the 
variable leaving out of the model (i.e., POUT) was set at 0.2; the coefficient of determination 
is equal to 0.8886 in the acquired model, much higher than the heritability of kernel weight. 


In summary, two aspects can be considered when determining whether there is 
an overfitting problem in practical mapping populations. The first one is to compare 
the R? of the regression model with the broad-sense heritability of the trait. If the R? 
of the regression model is much higher than heritability, the overfitting problem 
occurs. Of course, the estimation of variance components and broad-sense heri- 
tability requires replicated phenotypic observations (see §1.4 and §1.5, chapter 1 for 
details). The second one is to check the mapping results to see if there are closely 
linked QTLs. The presence of close linkage between the detected QTLs indicates the 
overfitting problem occurs. When overfitting occurs, the problem can be gradually 
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improved, and the number of false positive QTLs can be reduced by continuously 
reducing the probability for marker variables to enter into the regression model. In 
one mapping population, different traits may have different heritabilities, and the 
QTLs controlling different traits are not the same. Therefore, the suitable levels of 
PIN to select the ideal regression models for different traits may be different as well. 


Exercises 


5.1 Assume that two independent QTLs (£.e., located at two chromosomes) control 
the phenotypic difference in plant height (cm) between two homozygous parents. 
Two alleles at one QTL are represented by Qı and qı, additive effect a, = 10, and 
dominant effect dı = 8. Two alleles at the other QTL are represented by Q» and qə, 
additive effect ag = 6 and dominant effect dọ = 5. Genotypes of the two homozygous 
parents are Qı Qı QəQə and qqq@2q@2. The average plant height of four homozygous 
genotypes is m = 100 and the variance of random error o” = 20 cm?. Effects from 
the two QTLs are of additivity; that is, there is no epistatic interaction. 


(1) Calculate the degree of dominance for each QTL. 
(2) Calculate the expected plant height for each of the nine QTL genotypes. 


5.2 Under the same conditions in exercise 5.1, assume that one DH population is 
derived from parents P; and Ps with genotypes Qı Qı Qə Qə and qı qı qəqə, respectively. 


1) Calculate mean, genetic variance, phenotypic variance, and broad-sense heri- 
tability of plant height in the DH population. 

2) Calculate genetic variance of each QTL and proportion of phenotypic variance 
explained by each QTL. 

3) Assuming that genotypes at Qı can be distinguished, but genotypes at Qə 
cannot, calculate mean and variance in each of the two sub-populations con- 
sisting of genotypes Qı Qı and qq at Qı. 

4) Plot the theoretical distribution of plant height in the DH population. 


5.3 Under the same conditions in exercise 5.1, assume that one F> population is 
derived from parents P; and Pə with genotypes Qı Qı Qə Qə and qı qı qəqə, respectively. 


1) Calculate mean, genetic variance, phenotypic variance, and broad-sense heri- 
tability of plant height in the Fə population. 

2) Calculate genetic variance of each QTL and proportion of phenotypic variance 
explained by each QTL. 

3) Assuming that genotypes at Qı can be distinguished, but genotypes at Qə 
cannot, calculate mean and variance of each of the three sub-populations 
consisting of genotypes Qı Qı, Qq, and qq at Qa. 

4) Plot the theoretical distribution of plant height in the F, population. 


5.4 How does the control in background genetic variation improve the statistical 
power in QTL detection? 


5.5 Use one DH or RIL mapping population available in the QTL IciMapping 
software to conduct QTL mapping using IM and ICIM. 
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(1) Plot the genome-wide profiles of LOD score and additive effect, obtained from 
the two mapping methods. 

(2) Tabulate the relevant information of the detected QTLs from the two mapping 
methods, including position on chromosome, most closely linked markers on 
both sides, and additive effect. 

(3) Compare the similarities and differences between the two mapping methods. 


5.6 Use one Fy mapping population available in the QTL IciMapping software to 
conduct QTL mapping using IM and ICIM. 


(1) Plot the genome-wide profiles of LOD score, additive effect, and dominant 
effect, obtained from the two mapping methods. 

(2) Tabulate the relevant information of the detected QTLs from the two mapping 
methods, including position on chromosome, most closely linked markers on 
both sides, additive effect, and dominant effect. 

(3) Compare the similarities and differences between the two mapping methods. 


5.7 Assume that there are two alleles at one locus, i.e., A and a, determining the 
flower color of plants. Plants with homozygous genotype AA have red flowers and 
those with aa have white flowers. In the absence of segregation distortion, the 
proportion of red flowers in the DH population derived from one cross between 
genotypes AA and aa is p = 0.5. Assume that ten DH lines are randomly selected, 
and the number of red-flowered DH lines is represented by random variable 
X. Based on the observed value of X, the significance test on segregation distortion 
can be performed, where non-distortion is the null hypothesis, that is, Hp: p = 0.5. If 
X <2 or X 2 8, the null hypothesis is rejected; otherwise, the null hypothesis is 
accepted. 


(1) Calculate the probability of type I error of the hypothesis test. 

(2) Assuming H4: p = 0.6 is true, calculate the probability of type II error and 
statistical power of the alternative hypothesis. 

(3) Assume that the white flower gene is tightly linked to a lethal gene so that the 
theoretical frequency of white flowers in the DH population is equal to 0.75. Use 
the same rejection region to calculate the probability of type II error and 
detection power of the alternative hypothesis H4: p = 0.75. 


5.8 Under the same conditions in exercise 5.7, assume that X < 1 or X 2 9 is the 
rejection region, and 2 < X < 8 is the acceptance region. 


(1) Calculate the probability of type I error in the hypothesis test. 

(2) Calculate the probability of type II error and detection power when Hy: p = 0.6 
is true. 

(3) Calculate the probability of type II error and detection power when H4: 
p = 0.75 is true. 

(4) What methods do you think can be used to improve the statistical power in 
hypothesis testing? 


5.9 Probability density of standard normal distribution N(0, 1) is given in equation 
Tı te. Let Z be the critical value of standard normal distribution when 


T 
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the right-tail probability is equal to 2: that is, ” f(x)dx = $. Due to the sym- 


metry of normal distribution, we have EZ f(x)dx = $, and therefore, 


+00 mö 
İL fe-0)4:-1- fir, — 0)4:-1-8 


Zp —oo 


Let Z,-, be the critical value of standard normal distribution when the right-tail 
probability is equal to 1 — 8, we have Z;_g, = —Z?. 


(1) Calculate the right-tail probability $ each for Z = 0.1, 0.5, 1, 1.5, 2, 2.5, and 
3.0, using the statistical table of standard normal distribution, or the formula 
related to normal distribution in Excel, 

(2) Calculate Zg each for right-tail probabilities 8 = 0.001, 0.01, 0.05, 0.1, and 0.5, 
using the statistical table of standard normal distribution, or the formula 
related to normal distribution in Excel, 


5.10 One crop cultivar is well-adapted and widely grown with an average grain yield 
of 400 kg ha”. There is a newly released cultivar that is expected to be 5% higher in 
yield. Under the significance level of 0.05, we wish to detect the superior of the new 
cultivar by a power of 0.90. Based on field experiments from previous years, the 
standard deviation of random error in one observed value is about 15 kg. 


(1) Using equation 5.40, calculate in Excel the minimum number of replications 
required in field experiment. 

(2) If standard deviation of error effect is reduced by half by improving the field 
management or using more efficient experimental design, what is the minimum 
number of replications required? 


5.11 Use the QTL IciMapping software and genetic model given in table 5.8 to 
compare the QTL detection power and false discovery rate from IM and ICIM. 
Dominant effects are not considered. A total of 500 F populations are simulated, 
each of the size of 200. The length of the support interval is set as 10 cM. 


5.12 In exercise 5.11, assume that the number of markers is increased, i.e., 25 
markers are evenly distributed on each chromosome, or marker density is equal to 
5 cM. Other conditions keep unchanged. Compare the QTL detection power and 
false discovery rate from IM and ICIM. For each mapping method, compare the 
difference in detection power and false discovery rate from the two marker densities. 


5.13 Assume that two marker loci A and B are located at 15 cM and 30 cM, 
respectively, on one chromosome, and one QTL is located at 20 cM on the same 
chromosome. Genotypes of two homozygous parents are represented by AAQQBB 
and aaqqbb. Phenotypic means of two QTL genotypes QQ and qq are known to be 
equal to 20 and 15, respectively (refer to exercise 4.4, chapter 4). 


(1) Calculate phenotypic means of the four marker genotypes in the bi-parental DH 
population. 

(2) Treating marker loci A and B as two QTLs, calculate the additive effects at loci 
A and B, and the additive by additive epistatic interaction between loci A and B. 
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5.14 Assume two marker loci A and B are located at 15 cM and 30 cM, respectively, 
on one chromosome, and one QTL is located at 20 cM on the same chromosome. 
Genotypes of two homozygous parents are represented by AAQQBB and aaqqbb. 
The phenotypic mean is equal to 20 for genotype QQ, 18 for Qq, and 15 for qq (refer 
to exercise 4.5, chapter 4). 


(1) Calculate phenotypic means of the nine marker genotypes in the bi-parental F, 
population. 

(2) Treating marker loci A and B as two QTLs, calculate additive and dominant 
effects at loci A and B, and four epistatic interactions between loci A and B. 


Chapter 6 


QTL Mapping for Epistasis 
and Genotype-by-Environment 
Interaction 


Allelic combinations at one genetic locus produce genotypes with different pheno- 
typic performances. The additive and dominant genetic effects thus defined are 
sometimes referred to as intragenic interactions between alleles at one specific locus. 
Genes at different loci sometimes interact with each other as well, and this inter- 
action is genetically referred to as epistasis, or sometimes intergenic interactions 
(Carlborg et al., 2006; Carlborg and Haley, 2004; Doerge, 2002; Lynch and Walsh, 
1998). Epistatic interactions are also important determinants of phenotypic traits 
and genetic evolution, which could maintain the additive variance and therefore 
assure the long-term genetic gain in breeding. However, the pattern of epistatic 
interactions for quantitative traits is complex, and the genetic models contain a 
large number of main and interactive effects. Detecting the epistatic QTLs and 
estimating their genetic effects still seem to be difficult at present (Li et al., 2008; 
Carlborg and Haley, 2004). Using DH and Fə populations as examples, this chapter 
introduces the inclusive linear models by considering the di-genic epistatic effects, 
and the application of the ICIM principles in two-dimensional genome-wide scan- 
ning for the detection of epistasis between two loci. Interactions involving more loci 
may rely on the creation of special genetic materials and mapping populations, such 
as chromosomal segment substitution lines with various segment numbers and 
near-isogenic lines with various gene combinations (see §9.2 in chapter 9 for more 
details). The last section of this chapter briefly describes the mapping method for 
QTL by environment interactions. 
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6.1 Epistatic QTL Mapping in DH Populations 


6.1.1 Linear Regression in Epistatic QTL Mapping 
and the Statistical Properties 


For the convenience of the theoretical derivation, we assume that two inbred parents 
Pı and Py» differ in m QTLs, which are distributed in m intervals flanked by m + 1 
markers on one chromosome. Intervals, where no QTLs are located, are viewed as 
having the QTL with effects of zero. Multiple QTLs located in one marker interval 
are not considered. The parental QTL genotype is assumed to be Qı Qı QəQə... 
QmQm for Pi, and qiqi 929.---Im%m for Py. For each DH line in the mapping popu- 
lation, X = (ay, zə, ..., Zm, Im+1) represents the known marker variables which are 
equal to 1 and —1, standing for the two parental marker types, respectively, and 
W = (un, Ur, ... Wm) represents the unknown QTL variables which are equal to 1 
and —1, standing for the two parental QTL genotypes, respectively. Additive effects 
of the m QTL are represented by aş, do, ..., and am, respectively, and the epistatic 
effect between QTLs j and k is denoted by aay, (j, k = 1, ..., m; j < k). Under the 
assumption of the additivity of QTL effects on phenotype, the genotypic value G of 
the DH line under the additive and epistasis model can be written in equation 6.1. 


m 


G=u+ ” ajwj + 5 Aj, Wj Wk (6.1) 


j=l j<k 


Given the m + 1 marker types of each DH line, it can be seen from equations 5.3 
and 5.4 as given in chapter 5 that the expectation of QTL genotypic indicator w, is 
dependent on the position of the QTL located between the jth and (j + 1)th 
markers and the length of the interval, that is, 


E(uylX) = Aj) aj + AR t+ 


dary TOL tmnm 
JL) 2r(1— r) 


d dee = tmnm (6.2) 
on JR) = 2r(1— r) 


where r is the recombination frequency between the jth and (j + 1)th markers, rr is 
the recombination frequency between the jth marker and jth QTL, and rp is the 
recombination frequency between the jth QTL and (j + 1)th marker. Therefore, the 
expectation of wjw; can be written in equation 6.3. 


= A/ZL) AKL) TjTg + A5(L) ARR) Lj T+ 1 (6.3) 
+ Ayr) AKL) Tj+ 1 Lh + Ain) AK(R) T+ 1Th+1 
Given the marker types, the expectation of genotypic value G as defined in 


equation 6.1 can be expressed by a linear model including the main and interactive 
effects of markers, i.e., equation 6.4. 
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(G|X) ə... L) % + Ain) 2/--1) 


+ > age Aj) AKL) Lik + AL) ARR) UT +1 
JZ (6.4) 
+ Ager) Ann) Tj 12k + 2/(R) ARR) +1%k+1) 
= fot S jTj + i 
iek 
where 
Bo =H 
B= AL) dı, 
Bi  2/-1(R) @j-1 + Aq) di 67777 
Pm+1 = Am(L) öm) 
By = Avaya) 4012, 
Bre = Ant) /a-ıçn) daa + Aara) dan, (k = 3,..., m), 
By, m+1 — = /ı(L )Am (R) Alm, 


Bj jaa = Àj-1R) AR) OGj-1y + 27-1(R)2)-4-1(1) 26)-1,/--1 


“Ak )Aj4 11) dayı (J = 2,..., m—1), 


Biy = 2/-i(R) Zi-i(R) 4üj-1,k-1 + Aj—1(R) AR(L) 94)-n,k + Zi) 2i—1(R) Aj k-1 
+ Ay) — 1, k#m+1,j<k-— 1), 
Bina = Àj- R)Am(R ) AAj—1,m + Aven (R) @Ajm (j= = 2, m 1), 


Pmm+ b” —. ddm—1,m 


Thus, the epistatic effect between the jth and kth QTLs only contributes to 
marker interactive coefficients f; p, Bj 41,4. Pypaı and B;+1,,41- If there is at least one 
empty interval between the two current intervals (j, j+ 1) and (k, k + 1), and no 
QTLs are located in their 7 intervals, i.e., (j— 1, J, (j+ 1, j+ 2), 
(k — 1, k), and (k + 1, k+ 2), dazı, düzü,k Güzlü ka, Güyü-i, @ dia, Güya, 
aaj and dayını are all edu to zero. In this case, Bj, = A 1)AxL) dair, 
Pitik = App ZM) ean, Piri Z/(L) açR) aa, and Bi 444.41 = Zi) AiqR) 4k- These 
sr ee 7: all the də A ‘effect information of ie epistasis between 
the jth and kth QTL, providing the theoretical basis for mapping epistasis by the 
ICIM method. 

Suppose that we have a number of n DH lines in the mapping population with 
observations on a quantitative trait of interest and m + 1 ordered markers. Based 
on equation 6.4, the following linear regression model can be constructed between 
phenotype and marker variables and then used to control the background genetic 
variation in QTL mapping. 
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m+1 


yi = Po + 5 Bizi + 5 By kTijTik + Si (6.5) 
j=l 


j j<k 


where y; is the phenotypic value of the ith DH line in the mapping population; fo is 
the overall mean of the linear model; zi? is an indicator variable for the genotype of 
the ith DH line at the jth marker, taking value 1 for the P; parental type, and —1 for 
the P, parental type; $, is the partial regression coefficient of the phenotype on the 
jth marker variable; p, + is the partial regression coefficient of the phenotype on the 
multiplication variable of the jth and kth markers; and si is the residual random 
error which is assumed to be normally distributed with mean of 0 and variance of oğ. 
Random errors are also assumed to be independent. 

A two-stage strategy of stepwise regression was adopted to determine the 
parameters of coefficients in equation 6.5. Significant marker variables were selected 
in the first stage, similar to the additive QTL mapping as has been introduced in 
chapter 5. Then stepwise regression was applied on the residuals from the first stage to 
select the significant marker pairs and estimate their effects in equation 6.5. Stricter 
probability levels should be adopted in the first stage to avoid over-fitting since the 
number of pairwise marker variables is much larger than the number of markers. 


6.1.2 Two-Dimensional Scanning on Di-Genic Epistatic 
QTLs 


When conducting the two-dimensional scanning for epistatic QTLs, there are two 
current testing intervals represented by (j, j + 1) and (k, k + 1), where j < k. The 
phenotypic values in equation 6.5 were adjusted by 


Ayi .. 2 ə m 5 Br sTir Tis (6.6) 
ržjj+1,k,k+1 reig+t 
s#k,k+1 


where $, and $ r,s are the estimates of coefficients $, and $, , defined in equation 6.5, 
respectively. The adjusted phenotype Ay; thus obtained contains the information of 
QTL in the two testing intervals, which includes two positions and two additive 
effects of the two interacting QTLs, and one epistatic effect between the two QTLs. 
In the meantime, the additive and epistatic effects of QTLs located on other 
intervals and chromosomes are completely controlled. The adjusted observation Ay; 
does not change until either of the two testing positions moves into the next interval. 

DH lines in the mapping population can be classified into sixteen groups based 
on their marker types (table 6.1). If there are two QTLs (with the two alleles 
denoted as Q), and q; and Qy and qi) at the two testing positions, Ay; follows a 
mixture distribution consisting of four QTL genotypes, i.e., GGO Q)0)4ıqı- 
GG OrQr, and 9;9;%% The four QTL genotypic means are represented by py, H2, H3, 
and u4. Proportions of the four QTL genotypes in each marker group can be 
acquired from recombination frequencies (table 6.1). QTL at the current two 
mapping positions can be tested by the following hypotheses: 


QTL Mapping for Epistasis and Genotype-by-Environment Interaction 249 


Ho : fy = My = kg = My, 
Ha : atleast two of Mi, Hə, Hş and u4 are not equal 


Then the log-likelihood function under the alternative hypothesis Hy is, 


İn L4 = ” om İsə Taf (Ayi, Mrs a (6.7) 


j=l des) l= 


where S) denotes the jth marker type group (j = 1,...,16); zi, (J=1,...,4) is the 
proportion of the th QTL genotypes in the jth group (table 6.1); f(e; u, 0°) 
represents the density function of the lth normal distribution N(ur, o7): and 
l= 1,...,4, representing the four QTL genotypes, i.e., Q;}Q;Q.Qr, Q)Q)aaı, GUO Qk 
and q/q/dıqı- 

From the four QTL genotypic frequencies in table 6.1, it can be seen that most 
DH lines have genotype Q;Q;Q;,Q; in group 1, genotype Q)Q)qıqı in group 4, geno- 
type qq;Q.Q; in group 13, and genotype qiq)qıqı in group 16. Therefore, the initial 
values of the five unknown parameters in the EM algorithm can be acquired from the 
four marker-type groups, i.e., 


ny 


uğ "şün mö — 27 Ayi, 


4 i= m:3 +1 

us? Ayi, ya =— Ayi, 

ma i=m:12 +1 m i=m15 +1 
aE : (Au döyü SS (Ann? 

= ; i 
Ti + Ng + miş + nie i=1 : i=m3 +1 : 
T)1:13 (0) n (0) 
2 2 
+ SO (Aww)? + 27 (Aua) 
n2 +1 i=ma5 +1 


where m—niş represent the numbers of DH lines in the 16 marker type groups; mn. 
represent the summation from nı to ma, and so on. In the E-step, the posterior 
probability of the ith DH line (i = 1,..., n) belonging to the lth (l= 1,...,4) QTL 
genotype was calculated as, 


w = naf (Ay; uo), o% »/> məf (Ayi Hy, 02) 


where j denotes the marker type group into which the ith individual is classified. In 
the M-step, the five parameters were updated as, 


TAB. 6.1 — Sixteen marker-type groups and their frequencies for the two current mapping intervals and the frequencies of QTL genotypes 
within each group. 


Marker type group 


9 


6 


Notes: 


“1? 


and 


ee, 


Frequency 


5 Ti+) 


2 Tj+ Le) (1 


gf — 1541) (1 = Tj+1,k)Tkk+1 
gf — Tjj+1)Tj+1,kTkk+1 


gf — 7541) 741K — Tk k+) 


tid titi+14( — Tkk+1) 


ulu... 
grist — Tj41,k)Tkk+1 


z+ — Tj+1k)(l— Tkk+1) 
z+ — Tj+1k)(1— Tk e+1) 


STjj+1(l— y+ ak) Tek +d 


2 

pulunu... 

2 TüiTikiəa(l — Thk+1) 

30 — Tjj+1) j+ k(l — Tk k+1) 


gf — Tjj+1)Tj+1lkTkk+1 


gf — risi) (La Tj+1k) Tk k+ 


5 (1 Tjj+1)(l 


2 Tj 41k) ğı 


» 


denote the Pı 


Tkk+1) 


Tkk+1) 


and Py, 


Two pairs of flanking 
markers 


M, 


Zi 


parental 


Mit 
T+1 
1 


1 


1 
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M; Misi 

Tk Tk+1 

1 1 

1 A 

Sl 1 
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T 1 

1 z 

-1 1 

= = 

1 1 

1 2 

cl 1 

və E 

1 1 

1 =f 

=i, 1 

Dj = 
genotypes, 


QTL genotype at the two testing positions 


Q5Qj Qk Qe Gakk 

Qj + ak + adin Qj — ük” dai 

Di P3 pi (1 ~ ps) 

Pi Pa pi (1 ~ pa) 

pı (1 — pa) Pi Pa 

pi (1 = ps) Pı P3 

P2 P3 pə (1 = ps) 

P2 Pa pə (1 = pa) 

p2 (1 — pa) P2 Da 

pə (1 — ps) P2 P3 

(1 = pə) ps (1 — pə) (1 =- ps) 

(1 = pə) pa (1 — pə) (1 =- pa) 

(1 — pə) (1 = pa) (1 = pə) pa 

(1 — pə) (1 = pa) (1 — pə) ps 

(1 = pı) ps (1 — pi) (1 =- ps) 

(1 = pı) pa (1 — pi) (1 =- pa) 

(1— pi) (1-pa) (1 pi) pa 

(=p) (1- ps) (1-7 pi) p 
respectively. pı = (1 — riy) 


ts Tg j+1)/(1 — Tjj+1) 


G UQk 


— di + ak” ada, 
-= pı) Ps 
-= pi) Pa 
= pı) (1 = pa) 
= pı) (1 ~ ps) 
7 po) P3 
” pə) pa 


-= pə) (1 — pa) 


— pə) (1 — ps) 
P2 P3 

P2 Pa 

pə (1 = pa) 

pə (1 = ps) 

Pı P3 

Pı Pa 

pi (1 = pa) 


pı (1 -= ps) 


GGUIT 


— di — a + aaj, 
(1 - pi) (1 — pa) 
(1 — pı) (1 — pa) 
(1 = pı) Pa 
(1 = pı) Ps 
(1 — pə) (1 — pa) 


(1 — pə) (1 = pa) 


(1 — pə) pu 
(1 — pə) Ps 
pə (1 — pa) 
pə (1 — pa) 
P2 Pa 
P2 P3 
pı (l= ps) 
pı (1 ~ pa) 
Pi Pa 
Pi P3 


p= (1- Tq )Taj+1/ ith 


P3 = (1-— rig) = ra k+1)/(1— eee), Pa = (1 = Tha) Pak +1/The+1, Where r, is the recombination frequency between two markers or between one marker and one QTL 


indicated by subscribes. r., = 0.5 if the two markers are located on two chromosomes. 


OGG 


Surddeyy ouex) pue sısAyeuy əSeyur? 
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The EM algorithm continues until the difference in the likelihood between two 
succeeding iterations reaches a pre-assigned precision, say 10°. The maximum 
likelihood estimates thus obtained are represented by ğı, flo, fig, fy and 6”. 
Relationships between QTL genotypic means and two additive effects (a; and az) 
together with their epistatic effect (aaj) are given as follows. 

Hy = UF ai + ak + adi, 


Hə = H+ Qj — ak — Adj, 


(6.8) 
Hə = U — di Tok — day, 
Ha = H — di — Ak + adi 
Therefore, the additive and epistatic effects can be calculated, i.e., 
1 
u = qün He “iş + pa), 
1 
q= g% + H2 — H3 — Ha), 
1 (6.9) 
ak = g% — Hy + H3 — Ha), 
1 
adi, = g% — liz — Hy + Hu) 


Replacing the QTL genotypic means in equation 6.9 with their estimates kı, fo, 
fig and ğu, the estimates of additive and epistatic effects can be acquired. For 
example, at one pair of scanning positions, the four QTL genotypes Q;Q;Q:.Q:., 
Q,Qaıqı, GUO, and qiqiqıqı: have the estimated genotypic values at 11.85, 7.27, 
10.27 and 11.61, respectively. From equation 6.9, we can have mw = 10.25, 
a; = —0.69, a, = 0.81, and aay, = 1.48. When ignoring the epistatic effect, i.e., only 
the two additive effects are used in predicting the genotypic values, genotype 
9% QQ; would be the one to have the highest genotypic values. But in fact, geno- 
type Q;Q;Q,Q; has the highest value (i.e., 11.85) among the four QTL genotypes. 
Therefore, when epistasis is present, the best genotype cannot be properly 
determined by the best genotypes at individual loci, and the distinction between 
favorable and unfavorable alleles at individual loci may not always make sense. 

Under the null hypothesis Ho, all Ay; follow a normal distribution of N (uo, og). 
The log-likelihood function is, 


In Zo = X Inff(Ayi; Ho, 09)] (6.10) 


i=1 


Mean and variance of the normal distribution can be estimated as, 


. 1g 2 Iç” PE 
ñ ==) Ay ô=) (Avi — fio)” 
i=1 i=1 


ns 


The LOD score (denoted by LOD 4) at the current testing positions can be 
calculated from the log-likelihoods under the two hypotheses, i.e., equation 6.11, 
which can be used to test whether there is a significant difference among the four 
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QTL genotypes. There are five parameters to be estimated under Hy, and two 
parameters to be estimated under Ho. Therefore, the LRT statistic corresponding to 
LOD, approaches a z”-distribution with three degrees of freedom. The degree of 
freedom here is actually equal to the number of genetic effects to be estimated, i.e., 
two additive effects plus one epistatic effect. 


LOD, = max logiş La — max logy, Lo (6.11) 


As Ay; contains the information on QTL positions, additive, and epistatic effects 
of QTLs in the two testing intervals, both additive and epistatic effects affect the test 
statistic LODA. In order to test only the presence of epistasis, the influence of 
additive effects has to be removed, and therefore another alternative hypothesis H44 
is needed for this purpose, i.e., 


Haa : H — Hə — liş + H4 = 0, or equivalently aaj, = 0 


The difference in maximum likelihood estimates between H44 and Hy represents 
the net contribution from the epistatic effect, and their maximum likelihoods can be 
used to test the significance of the epistatic effect at the two scanning positions. 
Maximum likelihood estimates under Hy A is calculated by the conditional maximum 
of LA. Let, 


In Laa = In La — A(u — Hə — Hg + Ma) 


where 2 is called the Lagrange multiplier in calculus. In the EM algorithm, the 
calculation of posterior probabilities is the same as that for hypothesis H4. In the 
M-step, the five parameters have to meet the condition of hypothesis H44, which are 
updated as follows, 
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The LOD score (denoted by LODAa) calculated by equation 6.12 indicates 
whether the interaction is significant at the two testing positions. It is worth noting 
that the EM algorithms described above have a fast convergence speed. The pre- 
cision approaches 1077 within at most 10 iterations for most testing positions. 
Compared with hypothesis H4, one restricted condition is added in hypothesis H44, 
and the number of independent parameters is reduced by one. Therefore, the LRT 
statistic corresponding to LODAq approaches a y”-distribution with one degree of 
freedom. 


The difference between hypotheses H44 and Hp reflects the significance of two 
additive effects. If wanted, the difference can be tested by statistic 
LOD, — LODAA = logy) Laa — logio Lo, which approaches a x -distribution with 
two degrees of freedom. Therefore, the test statistic defined in equation 6.11 can be 
decomposed into two independent parts, one part testing the significance of one 
epistatic effect and the other part testing the two additive effects. Of course, additive 
effects are not the major objective in epistatic QTL mapping, which can be better 
tested and estimated from the one-dimensional scanning as described in chapter 5. 


6.1.3 Genetic Variance on Epistatic QTLs with Linkage 


From genotypic value G defined in equation 6.1, the theoretical additive variance 
can be given by, 


m 


Va = Var (>: on) = 5 Cov( wj, Wh) Qj Ox = 5 (1 = Zr jp) A; Ak (6.13) 


.. j,k=1 j,k=1 


where rj, is the recombination frequency between the jth and kth QTLs. The 
theoretical epistatic variance can be given by, 


Vr = Var ) aay, ww, | = ) Cov(wj We, W Wm) Aj, adım 
ek j<kl<m (6.14) 


= 5 (a 2rj)(1 — 2rpn) — (1 2rjz)(1 — 2rm) İ aaj, aam 


j<kl<m 


where rx, Til, Tkm and Tim are the recombinant frequencies between the jth and kth 
QTLs, between the jth and ih QTLs, between the kth and mth QTLs, and 
between the ih and mth QTLs, respectively. It can be proved that in 
equation 6.14 Cov(wjwk, wim) = 0, if l> k and m< j. Equations 6.13 and 6.14 
can be used to evaluate the relative importance of epistatic variance for any 
defined genetic models containing the additive effects and di-genic epistasis or 
after a QTL mapping study. 
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6.1.4 Simulation Study on Epistatic QTL Mapping in DH 
Populations 


Assume one genome consists of four chromosomes, each of 100 cM in length, and 
evenly distributed with 11 markers (Li et al., 2008; Yi et al., 2003). There are seven 
QTLs (represented by Q;-Q7) with three di-genic interactions on the expression of a 
quantitative trait of interest, in addition to some additive effects of individual QTLs 
(table 6.2). Qı and Qg are located on the first chromosome at 25 and 45 cM with 
additive effects —Ü.7 and 0.9, respectively. Q and Q, are located on the second 
chromosome at 25 and 55 cM with no additive effects. Q5 and Qe are located on the 
third chromosome at 25 and 45 cM, one with no additive effect and the other with 
additive effect —0.9. Q? is located on the fourth chromosome at 15 cM with additive 
effect —0.9. The additive by additive epistatic effect is equal to 1.7 between Q, and 
Qə, between Q; and QA, and equal to —1.7 between Q; and Qe. The residual variance 
a? is adjusted to 1, and one hundred DH populations each of 300 lines are simulated. 


TAB. 6.2 — Positions (cM), additive and epistatic effects of seven putative QTLs. 


QTL Qı Qə Qs Qa Q; Qo Q7 
Chromosome 1 1 2 2 3 3 4 
Position (cM) 25 45 25 55 25 55 15 
Additive effect —0.7 0.9 0 0 0 —0.9 -0.9 


Epistatic effect Qı and Qə 1.7 QsandQ, 17 QsandQ -1.7 


In each simulated population with a size of 300, the number of total markers is 
far less than population size. As a result, in the first stage of stepwise regression, the 
largest P-value for entering variables (PIN;) is set at 0.05 and the smallest P-value 
for removing variables (POUT)) is twice of PIN,, which is normally adopted in most 
stepwise regressions. In the second stage, considering the increasing number of 
regression variables PIN is set as the square of PIN}, i.e., PINa = PIN,” = 0.0025, 
and POUT., is twice of PIN». 

The average LOD score profile and additive effect profile from the 
one-dimensional scanning of 100 simulated DH populations are shown in figure 6.1. 
Qs, Qu, and Qs are assumed to have no additive effects (table 6.2). In the 
one-dimensional scanning of additive QTL, the average additive effects around the 
three QTLs are close to zero. Four peaks on the average LOD score profile represent 
the positions of Qı, Qə, Qe and Q7. Qı, and Qə are linked in the repulsive phase, 
representing the most difficult scenario in QTL mapping studies. The average LOD 
scores at the two peaks are low, indicating the low detection powers of the two linked 
QTLs. Qe and Q?” are the only QTL with additive effects on the third and fourth 
chromosomes, respectively. Two peaks around the true positions of the two QTLs 
are obvious. LOD scores around the two peaks are high, indicating the high 
detection powers of the two QTLs. Effects at the two peaks are close to true additive 
effects as defined in table 6.2. 
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One dimensional scanning across 4 chromosomes, step 1cM 


Fic. 6.1 — The average LOD score profile (A) and additive effect profile (B) from the 
one-dimensional scanning of 100 simulated DH populations. 


Results in table 6.3 indicate that in a support interval of 10 cM, detection powers 
in one-dimensional scanning are equal to 0.07, 0.12, 0.57, and 0.73 (i.e., the detected 
times divided by 100, the number of simulated DH populations) for Qı, Qə, Qe, and 
Q7, respectively. Q; and Q; are detected in none of the 100 simulated populations. 
Q: are detected in two simulated populations, which should be counted as false 
positives. Therefore, the false discovery rate of the QTLs without additive effects is 
controlled at a very low level. The detected QTLs are not located in the 10 cM 
support intervals of Q,—Q,; and are counted as false positives, and the number of 
false QTLs is indicated at the end of table 6.3. 


TAB. 6.3 — Detected times, QTL position, LOD score, and additive effects from 100 simulated 
DH populations. 


Chr. QTL Times Pos. SE of LOD SE of Additive SE of 
(cM) position score LOD effect effect 

1 Qi 7 24.74 3.03 8.50 5.79 —0.9304 0.2972 
1 Qə 12 45.27 2.65 8.97 5.16 0.9851 0.2828 
2 Q 0 

2 Q, 0 

3 Qs 2 23.33 4.71 4.52 2.21 0.6623 0.1379 
3 Qs 57 55.47 2.74 8.25 5.19 —0.9509 0.2987 
4 Q, B 14.42 2.73 7.75 3.07 —0.9304 0.1899 


False QTL 85 

Notes: QTL support interval is 10 cM; population size is 300: true QTL positions and effects 
are given in table 6.2. Standard errors (SE) are given after QTL position, LOD score and 
additive effect. 
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The average LOD 44 profile and epistatic effect profile from the two-dimensional 
scanning of 100 simulated DH populations are shown in figure 6.2. Three clear peaks 
are present on the LODAA profile (figure 6.2A), corresponding to the three 
pre-defined epistatic interactions, i.e., between Q: and Qə, between Q; and Qa, and 
between Q; and Qe. The three epistatic effects are equal by their absolute values 
(table 6.2), therefore explaining the same amount of phenotypic variation. Peaks on 
the LOD 4, profile have similar heights (i.e., 11.23 for Qı x Qə, 13.76 for Q, x Q4 
and 14.90 for Qs X Qe). High LODAA values around the three peaks indicated the 
high detection powers of the three pre-defined di-genic interactions. The epistatic 
effects at the peaks are close to true effects as defined in table 6.2 (figure 6.2B). 


Di-genic epistatic effect 


LOD score testing epsistasis 
oo 


114 . 
1222999435 
33 
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Fic. 6.2 — The average LOD score profile (A) and epistatic effect profile from the 
two-dimensional scanning of 100 simulated DH populations. 


Detection powers of the three pairs of epistatic QTLs were close to 100% within a 
square support interval with an edge length of 10 cM (table 6.4). There were 10 pairs 
of epistatic QTLs that were not located within the support intervals, which are 
counted as false positives. Of the 95 detected interactions between Q, and Qə, the 
positions of Q: estimated at 24.79 cM and Q» estimated at 45.11 cM were close to 
their true positions of 25 cM and 45 cM; the mean estimates of two additive effects 
were —0.6964 and 0.8427, close to the true effects of —0.7 and 0.9, respectively; the 
mean estimate of epistatic effect was 1.6212, close to the true effect of 1.7. Of the 97 
detected interactions between Q; and QA, Q3 was estimated at 25.10 cM, and Q, was 
estimated at 55.05 cM, close to their true positions of 25 cM and 55 cM; the true 
additive effect of both QTLs was 0, so their estimates were also close to 0; the mean 
estimate of epistatic effect was 1.6164, close to the true effect of 1.7. Of the 97 
detected interactions between Q; and Q6, Q; was estimated at 24.79 cM, and Qe was 
at 55.21 cM, close to their true positions of 25 cM and 55 cM; the true additive effect 
for Q; was 0, so their estimates were also low; the mean estimate of additive effect for 
Qe was —0.8281, close to the true effect of —0.9. The average estimate of epistatic 
effect between the two QTLs is —1.6106, close to the true effect of —1.7. 
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The results in tables 6.3, 6.4, and figure 6.2 indicate that the mapping principle 
of ICIM is also suitable for the detection of di-genic epistatic QTLs in DH popu- 
lations. ICIM is able to effectively detect the epistatic effects between two loci 
through genome-wide two-dimensional scanning. ICIM is able to detect not only 
epistatic QTLs with both additive and epistatic effects but also epistatic QTLs with 
little or no additive effects but large epistatic effects. As far as the DH populations 
are concerned, additive effects of individual QTLs can still be accurately estimated 
by one-dimensional scanning in the presence of epistasis. 


TAB. 6.4 — Detection times of the di-genic epistatic QTLs, and their positions, LOD score, 
and additive effects from 100 simulated DH populations. 


Epistatic QTLs Times 1st pos. 2nd pos. LODAA Ist effect 2nd effect Epistasis 


(cM) (cM) 
Qı and Qə 95 24.79 45.11 11.93 —0.6964 0.8427 1.6212 
Qə and Q4 97 25.10 55.05 15.09 —0.0441 0.0304 1.6164 
Q: and Qe 97 24.79 55.21 15.74 0.0168 —0.8281 —1.6106 


False epistasis 10 


Note: Povver analysis is based on 100 simulated DH populations, each of size 300. "True 


positions and effects of QTL are given in table 6.2. The QTL support interval is a square with 
a side length of 10 eM. 


6.2 Epistatic QTL Mapping in F, Populations 


6.2.1 The Di-Genic Epistasis Model in F, Populations 


Assume there are two interacting QTLs in an F, population, i.e., Qı and Qə, w, and 
v are the indicators of genotypes at locus Qı, taking values 1 and 0 for genotype 
Q: Qı, 0 and 1 for genotype Qı qı, and values —1 and 0 for genotype qı qı: w and v 
are the indicators of genotypes at locus Qə, taking similar values as the indicators at 
locus Qı. The two QTLs have two additive effects (represented by a, and aş), and 
two dominant effects (represented by dı and də). Four epistatic effects can occur 
simultaneously in the F» populations, i.e., additive by additive epistasis (represented 
by aa), additive by dominant epistasis (represented by ad), dominant by additive 
epistasis (represented by da), and dominant by dominant epistasis (represented by 
dd). Using the genotypic indicators and genetic effects defined above, the nine 
genotypic values can be represented by equation 6.15. 


G= u+ anun + di vy + az w + də və 


6.15 
+ (aa)uyunş + (ad) wy v2 + (da)v w + (dd) w — 


Equation 6.15 gives the mean performances of nine genotypes at the tvvo inter- 
acting loci in the Fə population. Let un, Zə, ..., and zo be the nine genotypic values, 
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and the 9 genotypes are arranged in order, i.e., Qı Qı QəQə, Qı Qı Qəqə, Qı Qıqoqə, 
Qıq QəQə, Qa, Mabe, UUM, qiqi Qəq?, and qıqiqəqo. Equation 6.16 
gives the relationship between the genotypic values and genetic effects. 


Ly 1 1 0 1 0 1 0 0 0 u 
[ly 1 1 0 0 1 0 1 0 0 ay 
Ls 1 1 0 -1 0 -I1 0 0 0 dı 
Ha 1 0 1 1 0 0 0 1 0 a 
wy |=|1 0 1 0 1 0 0 0 1 də (6.16) 
Lg 1:00 1 -1 0 0 0 -10 aa 
Lz 1 -1 0 1 0-1 0 00 ad 
Lg 1 -1 0 0 1 0 -1 0 0 da 
Lg 1 -1 0 -1 0 1 0 0 0 dd 


Given the nine genotypic values, genetic effects can be calculated by 
equation 6.17. 


0.25 0 025 0 0 0 025 0 0.25 \ /ın 
a 0.25 0 —025 0 0 0 02 0 -0.25){ m, 
2 —0.25 0 —0.25 05 0 05 —025 0 -025İİ i 
2 0.25 0 025 0 0 0 025 0 —üUZ5İİ py 
d, |= | -0.25 05 -0.25 0 0 0 —025 05 -0.25 || u 
aa 025 0 —025 0 0 0 025 0 0.25 İİ m 
ad —0.25 05 —0.25 0 0 0 0.25 —0.5 0:25 || m 
da —0.25 0 0.25 0.5 0 —0.5 —0.25 0 0.25 || u 
ae 0.25 —0.5 025 05 1 -05 025 —0.5 0.25 / \ mw 


(6.17) 


It can be seen from equation 6.17 that u defined in equation 6.15 is equal to the 
average of four homozygous genotypic values, i.e., 4, 43, 7, and Hg. Homozygous 
genotypes are normally acquired by repeated selfing. Therefore, the genetic model 
defined by equation 6.15 is sometimes called the F,.-model (F-infinite model). 


6.2.2 Epistatic QTL Mapping Procedure in F, Population 


Assume two inbred parents Pı and Pə differ at a number of m QTLs, which are 
distributed in m intervals flanked by m + 1 markers on one chromosome. Intervals, 
where no QTLs are located, are viewed as having the QTLs with effects of zero. 
Multiple QTLs located in one marker interval are not considered. QTL genotype is 
Q: Q) Q2Q2.--QmQm for Pi, and 441 4242---dm%m for Py. In the Fə population, 
X = (m, D, ..., Zm Ona) and Z = (zi, 22, ..., Zm Zm+1) represent the known 
marker types, where z and z take values 1 and 0 for the Pı marker type, 0 and 1 for 
the heterozygous marker type, and —1 and 0 for the Pz marker type. W = (un, w, ... 
Wm) and V = (v1, v2, ... Um) represent the unknown QTL genotypes, taking similar 
values as the marker type indicators x and z. Additive effects of the m QTLs are 
represented by dı, də, ..., and am, and dominant effects of the m QTLs are 
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represented by dı, də, ..., and dm. Four epistatic effects between QTLs j and k are 
denoted by aaj, adj, da, and ddy (j, k = 1, ..., m; j € k). 

If the four epistatic effects are all equal to 0, two QTLs do not interact with each 
other. Under the additive, dominant and epistasis genetic model, the genotypic 
value G of an Fə individual can be written in equation 6.18. 


G = u+ Gap + Grpr (6.18) 


where 


m 


GAD = 5 (ajwj + djvj) 
j=1 

GEPI = 5 (aaj; Wj Wy, + adı, yü, + day uyu, + ddi, U; Uk) 
j<k 


Expected values of additive and dominant genetic effect Gap under various 
marker types in equation 6.18 are already given in equations 5.23 and 5.24 in 
chapter 5. In theory, it is also possible to derive the expected values of epistatic 
genetic effect Gepr under various marker types. However, similar to dominant effects 
that can cause second-order interactions between markers, the epistatic effects 
between QTLs in F populations can cause third- and even fourth-order interactions 
between markers. These higher-order interactions between markers, even if they do 
exist, are difficult to be estimated accurately in genetic populations with limited 
sizes. Therefore, in practical applications, to avoid the overfitting problem arising 
from too many marker variables in the linear model of phenotype on markers in Fə 
populations, we ignore the effects of epistatic QTLs and performed only one 
regression for variable selection. 

For epistatic QTL mapping in FP? populations, stepwise regression is conducted 
to estimate the linear model as defined by equation 6.19. Equation 6.19 used here is 
actually identical to equation 5.28 in chapter 5, which fits only the additive and 
dominant effects of individual QTLs. The regression model does not consider the 
higher-order interactions between markers caused by the epistatic effects between 
QTLs. 


m+1 m+1 m m 
y = Bot 5 Bizik 5 7/2) + 2 laya T ARETE +e (6.19) 
j=l j=l j=l j=l 


Based on the estimation of the linear model as defined by equation 6.19, the 
two-dimensional genome-wide scanning is performed. Assuming that the two cur- 
rent scanning positions are within the interval of the jth and (j + 1)th markers, and 
the interval of the kth and (k + 1)th markers, the phenotype is adjusted by equa- 
tion 6.20, using the results of stepwise regression on equation 6.19. Equation 6.20 is 
also similar to equation 5.29 in chapter 5 for phenotypic adjustment, except that 
both marker intervals are considered here in equation 6.20. 


Ag 2. Parthas 2.) İzin 1 + eee] (6.20) 
rÆjj+1,k,k+1 TEIR 
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Assume that there is one QTL in each of the two currently scanning intervals. 
Phenotypes of nine genotypes at the two QTLs follow the normal distribution 
N(u),02), where 1 = 1, 2,..., 9, and the order of the nine genotypes is the same as 
that given in equation 6.16. The null and alternative hypotheses to be tested are, 


Ho : [ty = Hy = +++ = Hg, 
HA : [ly,-- +; Ug at least two of them are not equal to each other 


Under the alternative hypothesis H4, the log-likelihood function is given, i.e., 


81 9 
In Ly => hn b — (6.21) 


j=l i€S, 1 


where S) is set of the jth marker group (j = 1, 2, ..., 81), my (L= 1, 2, ..., 9) is the 
frequency of the th QTL genotype in the jth marker group, and f(e; u, o?) is density 
function of normal distribution N();,¢7). In equation 6.21, zy is the frequency of 
each QTL genotype in each marker group (j = 1, 2, ..., 81; L= 1, 2, ..., 9), which can 
be calculated from the frequencies given in table 4.9 in chapter 4. 

Assume that the frequencies of the three QTL genotypes under a certain marker 
type at the first scanning position are denoted by p11, pis, and piş, which are equal 
to the frequencies of the last three columns corresponding to the marker type in 
table 4.9 in chapter 4 divided by the frequency of the marker type. In the second 
scanning position, the frequencies of the three QTL genotypes under a certain 
marker type are denoted by pər, poo, and pəz, respectively, which are also equal to the 
frequencies of the last three columns corresponding to the marker type in table 4.9 
divided by the frequency of marker type. Table 6.5 gives the calculation of the 
frequencies of the nine QTL genotypes under particular marker types of the two 
currently scanning intervals, i.e., 7g included in equation 6.21. 


TAB. 6.5 — Calculation of the nine QTL genotypic frequencies at two scanning positions in the 
F, population under one marker type at the two scanning intervals. 


Genotypes and frequencies Genotypes and frequencies at the second 
at the first scanning position scanning position 

Qə Qə, P21 Qəqə, P22 qə qə, P23 
Qı Qı, pu Pup. P11P22 P11P23 
Qıqı, Pio Pi2P21 P12P22 P12P23 
AU, P13 P13P21 P13P22 P13P23 


The EM algorithm is used to calculate the maximum likelihood estimates of the 
parameters in the log-likelihood function of equation 6.21. Depending on frequencies 
of QTL genotypes contained in various marker groups, initial values in the EM 
algorithm can be defined by the genotypes having the highest frequencies, i.e., 
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i=1 ns 


i=n4 +1 m; mg +1 
(0) 1 4:37 nal (0) 1 Tab 
H4 = ar... 5 AP;, — X AP, Hs = nas 5 AP;, 
i=m:36 + 1 lig Tao +1 =m 44 +1 
(0) 1 Tik?3 (0) 1 N77 Tusi 
ML $ ap, aL SF ap, pL $ü ap, 
N73 =m.72+1 n77 i=n1:76 +1 i=n:s0 + 1 
1 ny 
2) — (AP; — wo)? 
: m + ns + ng “naz + Nay + Nas + N73 + nor + nei | E i 
ni:5 0 N19 11:37 
( 
+ 5 (AP, — ub 7 + 5 (AP; — u)? + 5 (AP, — poy? 
i=m4+1 1 nis + i=n:36 +1 
M4141 (0)\2 N45 (0)\2 11:73 (0)\2 
+ 2, (AP as) + 2) (AP. — 45")? + 2) (AP: #7”) 
i=n:40 +1 miq +1 i=m:72 +1 
N77 (0)\2 74:81 (0) 
+ (ap d+ 3) (AP; P 
imei i=M30 +1 
In the E-step, the posterior probability that the ith (i = 1, ..., n) individual 


belongs to the th (l= 1, ..., 9) QTL genotype is, 


(0) maf(APis pi” əs) 
il = 9 
Whe Mnf(APis uy”, 02°) 


where jis the number of the marker group in which the ith individual is located. In 
the M-step, the parameters to be estimated are updated as, 


n (0) 
) Lier Wy AP; 
m = Sry (.“1)2,..:9), 


The EM algorithm continues until it converges, i.e., the difference in the 
log-likelihood functions between two succeeding iterations reaches a pre-given pre- 
cision, e.g., 10-°. Maximum likelihood estimates of the QTL genotypic means and 
variances are obtained at the end of the iterations, which are substituted into 
equation 6.17 to have the estimates of genetic effects. 

Under the null hypothesis Hp, the nine QTL genotypes follow the same normal 
distribution N (up, 02), and the log-likelihood function is, 
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In Ly = 5 In f(Ayi; Ho, (0) 
il 


The parameter estimates for null hypothesis Hp are obtained by solving the 
likelihood equation above, i.e., 


Similar to equation 6.11, the difference between the two maximum log-likelihood 
functions reflects the significance of the difference between the two hypotheses. 
LOD, thus obtained reflects the significance of the difference among the nine QTL 
genotypic means. If they are significantly different from each other, it remains 
unknown whether this difference is from the additive and dominant effects of indi- 
vidual QTLs or from the epistatic effects between the two QTLs. Therefore, another 
hypothesis needs to be tested, namely, 


HAA: aa= ad = da = dd = 0 


Maximum likelihood estimates under hypothesis H44 are obtained using the 
calculation of conditional maxima, i.e., 


In Lag = ln La — Aaa — 2əad — Agda — Aadd 


where 41, Zə, Aş, and 24 are the Lagrange multipliers for the four constraints given by 
hypothesis H44. The parameter estimation for hypothesis H44 also needs the EM 
algorithm. The computational procedure of the E-step is exactly the same as the 
E-step for hypothesis H4, but the update of parameters in the M-step requires 
solving a set of quadratic equations. The elements of the 4 X 4 square matrix C on 
the left side of the equations are, 
1 1 1 1 
a1 — (0) 


n n (0) n (0) n (0) 
ier Ma Dwg DiW Poi Wig 


1 1 i 1 1 
cı = n 0 n 0 n 0 n 0 ; 
eI ul) Xi uk) Ht uğ Lee uk) 


1 1 2. 1 1 
13 = n 0 n 0 n 0 n 0 ; 
Xi w et uk Viet uk 24 uk) 


1 1 1 1 
G4 = 0) | 1 


T T ; 
n n (0) n (0) n (0)? 
Xi Ma Xi Ws 22/-ı r Dini Wig 


C21 = C12; 
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1 4 1 1 4 1 
C22 = n 0 n 0 n 0 n 0 n 0 n 0 , 
et ul Xi uğ) nı uk Xi uk Da u) ə uk 


1 1 x 1 1 
023 = n 0 f n 0 n 0 n 0 ; 
22x-ı ol) ə uğ x uk ə. uğ 


1 4 x 1 1 4 1 
C24 = n 0) | n 0 n 0 n 0 n 0 n 0 , 
Sei ul) Xi u) 20 uğ Xi wi) Vel uğ — uk) 


C31 = C13; C32 = Coş, 


1 1 4 4 1 1 
033 = n 0 n 0 n 0 n 0 n 0 n 0 , 
22/-1 ul) Vel uğ) Viet uk) zı uk Jai u) : nı uk) 


1 1 4 4 1 1 
C34 = A 0 a 0 m 0 n 0 | n (0) n (0) , 
2 ul) — uğ ə ul 2. uğ Viet Viz Di Vig 


C41 = C14; C42 = C24; C43 = C34; 


1 4 1 4 16 
C44 = (0) 


n n 0 n 0 n 0 n 0 
ie Va 22x-ı uğ) zə u, : ə uk ae uğ 


4 1 4 1 
n 0 n 0 n 0 n 0 
Des uk .. uk ea uk) Viet uk) 


The 4 X 1 vector b on the right-hand side of the equations is, 


1 1 1 | 1 
m (0) n (0) m (0) 1 n (0) 
— Wi DA Wig ə Wiz — Wig 
ib 2 1 | 1 2 1 


Mər əyər əyər a De əsə əya 
b= y tg t ar y + 
D w x Wig Seen wi haan wi a wi, Sey Wy 
1 2 | 1 2 | 4 2 
wow Lae Da eo ee Daw 
| 1 2 f 1 
5207 See D, 


The unknown parameters of the equations are the products of the Lagrange 
multipliers and the error variance, i.e., 262, 4262, 4307, and 2402. The inverse of the 
matrix C is calculated first, and then the solution to the equations can be calculated by, 
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Ci Ci Ci3 C14 
120 2 — | C21 O2 C23 C2 b 
130 : > C31 C32 €33 C34 
Ayo? C41 C42 C43 C44 


In the M-step of the EM algorithm, mean values of the nine QTL genotypes are 
updated by, 


- İyluğ Ay; — 2(2262)0) 4.2(Ago2) 


il 


> uğlAy + (A102) + — 


ol? Ay; — 4(A107) JÈL 


= 
se 
II 
T 
3 
il, 


ul) O Ay; + 2(2:62 . ) E2(2462) PSs : 


i=l 


ab = [Sow Wig O Ay; + 2(Ago? a + 2( (240?) 
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The EM algorithm continues until it converges to obtain the maximum likeli- 
hood estimates of the parameters and the maxima of likelihood function under 
hypothesis H44. Defining the statistic LODAa similarly to equation 6.12 will 
exclude the additive and dominant effects of both QTLs on the test statistic, which 
reflects only the significance of epistatic effects between the two tested QTLs. In 
practical applications, the significance of difference among the nine genotypes at two 
loci can be firstly tested by LOD 4. Given the significant difference among the nine 
genotypes, the significance of epistatic effects is then tested by LOD 44. Given the 
significant difference in epistatic effects, further testing can be made to distinguish 
which epistatic effects are significant and which are not. For this purpose, more 
hypotheses have to be built and the LOD, ,-like statistics have to be constructed. 
For simplification, this book does not further distinguish which of the four epistatic 
effects, i.e., aa, ad, da, and dd, are significant, and which are not. In practice, the 
significance of the four epistatic effects can be roughly determined by the magnitude 
of their estimated values. 


6.2.3 Detection Power of Epistatic QTLs in Fo 
Populations 


Using the positions and effects of the seven QTLs from table 6.2 as the genetic 
model, 100 Fə populations, each of 300 individuals, were simulated to conduct 
the power analysis for additive and dominant QTLs by one-dimensional scanning, 
and power analysis for epistatic QTLs by two-dimensional scanning of ICIM. 
Results of power analysis from one-dimensional scanning for additive and dom- 
inant QTL (table 6.6) showed that within the 10 cM support intervals, Qı, Qə, 
Qe, and Q, had detection powers of 0.24, 0.35, 0.64 and 0.84 (detection times 
divided by 100, the number of simulated F, populations), respectively, which 
were higher than those from the simulation DH populations (table 6.3). Qs, Qu, 
and Qs were also detected in a number of simulated F» populations. The number 
of false positive QTLs outside the seven support intervals was also much higher 
than that of the DH population (tables 6.3 and 6.6). This is mainly due to the 
fact that the same LOD score threshold was used in the simulated Fə and DH 
populations. The test statistic from the one-dimensional scanning has one degree 
of freedom in DH populations, but two degrees of freedom in Fy populations. For 
the actual QTL mapping studies, a higher LOD score threshold is needed for the 
Fə population in order to maintain the false positive rate at a similar level as the 
DH population. 

Of the effect estimates, only Q7 had an additive effect close to the true value of 
—0.9 (table 6.2) and a dominant effect close to 0. The other QTLs were estimated 
to have dominant effects, which were caused by genetic linkage and epistatic 
effects. As defined in table 6.2, Q7 does not interact with other QTLs; the other six 
QTLs make three epistatic networks. Due to the presence of epistatic effects, the 
genetic effects of individual QTLs may not be properly estimated by the 
one-dimensional scanning as far as Fə populations are concerned. As will be seen 
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TAB. 6.6 — Detection times, QTL position, LOD score, and genetic effects from the 
one-dimensional scanning on 100 simulated F» populations. 


Chr. QTL Times Pos. (cM) SE LOD score SE Genetic effects 


Additive Dominance 


1 Qı 24 23.17 1.77 7T.78 8.60 —0.4144 —0.6233 
1 Q2 35 47.09 2.37 9.84 11.28 0.6244 —0.4861 
2 Qs 12 25.67 2.39 6.58 6.69 —0.0147 “—1.0000 
2 QA 20 54.25 1.64 6.99 8.06 —0.0068 “—0Ü.9586 
3 Qs 9 25.67 2.45 4.78 5.04 —0.1358 0.7743 

3 Qe 64 55.33 2.29 12.99 13.53 —0.9276 0.3416 

4 Q7 84 14.82 2.38 10.79 11.21 —0.8785 “—0Ü.0102 


False QTLs 321 
Notes: QTL support interval is 10 cM; population size is 300; true QTL positions and effects 
are given in table 6.2. Standard errors (SE) are given after QTL position and LOD score. 


shortly, the two-dimensional scanning can give a better estimation of the genetic 
effects of individual QTLs. 

Results of the power analysis on epistatic QTLs from the two-dimensional 
scanning (table 6.7) show that the detection power within the square support 
interval with a side length of 10 cM is 0.39, 0.47, and 0.48 for the three pairs of 
epistatic QTLs, respectively, lower than the detection power from the DH popula- 
tion. There are 128 pairs of epistatic QTLs that are not within the support interval, 
which is much higher than the number of false positives from the DH population 
(table 6.3). The test statistic for the two-dimensional scanning had a total of 3 
degrees of freedom in the DH population, and 1 degree of freedom to test for the 
epistatic effect; in the Fə population it had a total of 8 degrees of freedom, and 4 
degrees of freedom to test for the epistatic effects. Therefore, a much higher LOD 
threshold is required when the Fy population is used for epistatic QTL mapping. 

Of the 39 detected interactions between Q, and Qə, the positions of Qı at 
23.85 cM and Qs» at 41.79 cM were somewhat different from the true positions of 25 
and 45 cM; the mean estimates of the two additive effects were —0.4198 and 0.4299, 
respectively, both of which had lower absolute values than the absolute values of 
true additive effects of —0.7 and 0.9; the two dominant effects were not significantly 
different from 0; and the mean estimate of the additive and additive epistatic effect 
was 1.4622, which was lower than the true effect of 1.7. Compared with the mapping 
results in DH populations in table 6.4, there were also large deviations in the esti- 
mates of positions and effects for the other two pairs of epistatic QTLs. Therefore, 
for the genetic model defined in table 6.2 which contains both linkage and epistasis, 
the DH population has higher detection power, and higher accuracy on the estimates 
of QTL positions and effects, in comparison with the F» population. Detection power 
analysis on more epistatic models in DH and F» populations will be given in the next 
section, together with the reason for different detection powers observed in different 
models in the two types of populations. 


TAB. 6.7 — Detection times of the di-genic epistatic QTLs, and their positions, LOD score, and genetic effects from two-dimensional scanning on 


100 simulated F» populations. 
Epistatic QTLs Times 


Q: and Q2 39 
Q; and Q4 47 
Q: and Qe 48 
False epistasis 128 


Notes: QTL support interval is 10 cM; population size is 300; 


LOD score 


6.93 


8.46 


8.47 


Position 


23.85 
41.79 
23.62 
53.19 
25.00 
51.67 


Genetic effects 


a d aa ad da dd 
—0.4198 0.0084 1.4622 0.1164 —0.1905 0.0345 
0.4299 —0.1255 

—0.003 0.0262 1.3940 —Ü.0541 0.0515 0.0254 
—0.0038 —0.0601 

—0.0275 —0.1316 —1.4571 0.0594 0.0361 —0.0933 


—0.5723 0.0548 


true QTL positions and effects are given in table 6.2. 
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6.3 Genetic Analysis and Detection Power of the Most 
Common Di-Genic Interactions 


6.3.1 Genetic Effects in Di-Genic Interactions 


Assuming there are two independent loci (£.e., no linkage), represented by A and 
B, allele A is dominant to allele a at locus A, and allele B is dominant to allele 
b at locus B (Zhang et al., 2012). When there is no interaction between the two 
loci, there are four phenotypic classes following the segregation ratio 9:3:3:1 in 
the Fə population derived from two homozygous parents AABB and aabb. The 
genotypes corresponding to the four classes of phenotypes are generally denoted 
as A_B_, A bö, aaB_ and aabb, where A denotes the mixed genotypes AA 
and Aa, and B_ denotes the mixed genotypes BB and Bb. Assume that 
the phenotypic value is 3 for A B , 2 for A bö, 1 for aaB_, and 0 for aabb. 
The four phenotypes have a segregation ratio of 9:3:3:1 in the Fə population 
(table 6.8). 

When there are some interactions between the two loci, fewer phenotypic 
classes will be observed (Bernardo, 2010). For example, if A_B_ has the phe- 
notypic value of 2, A_ bb has the phenotypic value of 1, and aaB_ and aabb 
have the same phenotypic value of 0, the three classes of phenotypes follow a 
segregation ratio of 9:3:4. In addition to the two segregation ratios of 9:3:3:1 and 
9:3:4, table 6.8 gives 12 other segregation ratios which are commonly observed in 
genetics and the phenotypic values taken for various genotypes. To facilitate the 
comparison between populations, the last column shows the segregation ratios 
observed in the DH population derived from the same two parents AABB and 
aabb. 

In Fy populations, two interacting QTLs have two kinds of main effects, i.e., 
additive and dominance, and four kinds of epistatic effects, i.e., additive by addi- 
tive, additive by dominance, dominance by additive, and dominance by dominance 
(equations 6.15-6.17). In DH populations, two interacting QTLs only have two 
main additive effects and one additive by additive epistatic effect (equations 6.1, 
6.8, and 6.9). Using equation 6.17, various genetic effects can be calculated from the 
nine genotypic means in table 6.9. It can be seen that for the segregation ratio 
9:3:3:1, the two additive effects are equal to 1 and 0.5, the two dominant effects are 
equal to the respective additive effects, and the four epistatic effects are equal to 0. 
The segregation ratio of 9:7 is referred to as complementary epistasis in genetics. As 
can be seen from table 6.9, when all genetic effects are equal, such a segregation 
ratio will appear in the Fy population, and the corresponding phenotypic trait is 
controlled by the complementary epistasis of two unlinked loci. The segregation 
ratio of 15:1 is called duplicate dominant epistasis. With this segregation ratio, the 
two additive and two dominant effects take the same value, the four epistatic effects 
take the same value, and the epistatic effects are in the opposite direction of the 
additive and dominant effects (table 6.9). 


Segregation ratio in 
Fy 
9:3:3:1 
9:3:4 
12:1:3 
3:9:4 
12:3:1 
9:7 
3:13 
9:4:3 
9:1:6 
10:3:3 
15:1 
3:12:1 
10:6 
6:9:1 


TAB. 6.8 — Phenotypic values for 14 segregation ratios commonly observed in F populations. 


AABB 


PrPFrRFNNNOFRNF NY DY W 


— 


AABb 


PrRrRFNNNOFNF o DY W 


— 


Phenotypic values (or genotypic means) 


AAbb 


Ço o O r OFF CONN Mo — WD 


bo 


AaBB 


PrRrPFNNNOrFRNF ND Co 


m= 


AaBb 


PrRrRFNNNOFNF NY DY W 


— 


Aabb 


Ço o — — ÇO — — CD Do Ə OO FF Wb 


bo 


aaBB 


NOorrooqoococooroeoo;qorF 


Note: The last column is the ratio in a DH population if the same phenotypic values applied. 


aaBb 


ə CD İİ CD O O O O İK O 5 5 — 


aabb 


O = CD CD İD O İK OOO OFC Oo 


Segregation 
ratio in DH 
1:1:1:1 

1:1:2 

2:1:1 

1:1:2 

2:1:1 

1:3 

3:1 

1:2:1 

1:2:1 

2:1:1 

3:1 

2:1:1 

1:1 

1:2:1 


UoTr)əeqəqu şuəuruoqrury-Aq-əd Aqovər) pue siseşsid 10} Surddeyy TLO 


69c 


270 Linkage Analysis and Gene Mapping 


TAB. 6.9 — Genetic effects of two QTLs showing various segregation ratios in Fy population. 


Segregation ratio in Fg x a dy a dy aa ad da dd 
9:3:3:1 1.5 1.00 1.00 0.50 0.50 0.00 0.00 0.00 0.00 
9:3:4 0.75 0.75 0.75 0.25 0.25 0.25 0.25 0.25 0.25 
12:1:3 1.25 0.75 0.75 -0.25 —0.25 0.25 0.25 0.25 0.25 
3:9:4 0.75 0.75 0.75 —0.25 -0.25 -0.25 -0.25 —Ü.25 -0.25 
12:3:1 1.25 0.75 0.75 0.25 0.25 0.25 -0.25 —0.25 —-0.25 
9:7 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 
3:13 0.25 0.25 0.25 -0.25 -0.25 —Ü.25 -0.25 -0.25 -0.25 
9:4:3 1.00 0.50 0.50 0.00 0.00 0.50 0.50 0.50 0.50 
9:1:6 0.75 0.25 0.25 0.25 0.25 0.75 0.75 0.75 0.75 
10:3:3 1.25 0.25 0.25 -0.25 —0.25 0.75 0.75 0.75 0.75 
15:1 0.75 0.25 0.25 0.25 0.25 0.25 -0.25 -0.25 -0.25 
3:12:1 1.00 0.50 0.50 0.00 0.00 0.50 —0.50 —Ü.50 —0Ü.50 
10:6 0.50 0.00 0.00 0.00 0.00 0.50 0.50 0.50 0.50 
6:9:1 1.25 0.25 0.25 0.25 0.25 0.75 -0.75 —Ü.75 —0.75 


The Fə population contains all nine possible genotypes at two genetic loci, and 
therefore the additive and dominant effects at each locus can be estimated, as well as 
the four epistatic effects between the two loci. The DH population (or any other 
permanent populations) contains only the four homozygous genotypes, and there- 
fore only the additive effect at each locus can be estimated, as well as the additive by 
additive epistatitc effect between the two loci. 


6.3.2 Decomposition of Genetic Variance at the Presence 
of Di-Genic Epistasis 


Variance is a population parameter that depends not only on various genotypic 
means in the population but also on the frequencies of genotypes that are present 
in the population. The segregation ratio of 9:3:4 in one Fə» population is used here 
as an example to illustrate the decomposition of genetic variance between two 
genetic loci. Assume there is no linkage between the two loci and the four allele 
frequencies are equal to 0.5. The frequencies of the nine genotypes and the mar- 
ginal frequencies at the two loci are shown in table 6.10. The marginal frequency of 
genotype AA at locus A is defined as the sum of the frequencies of three genotypes 
at locus B, ie., fi. = fir + flo + fiz = 0.25. The other marginal frequencies are 
calculated similarly to genotype AA. 

Under the assumption of no linkage between locus A and locus B, the frequency 
of each combined genotype in table 6.10 is equal to the product of two corresponding 
marginal frequencies. More generally, if one population is in Hardy-Weinberg 
equilibrium and there is no linkage disequilibrium between two loci, the frequencies 
of combined genotypes at two loci are also equal to the products of marginal fre- 
quencies. Therefore, the methods introduced below in calculating and decomposing 
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TAB. 6.10 — Expected frequencies of various genotypes in the Fy population at two unlinked 
loci A and B. 


Locus A Locus B Marginal frequency 
“ə m,.. a cy | 

AA fir = 0.0625 fi, = 0.125 fig = 0.0625 f. = 0.25 

Aa fı = 0.125 foo = 0.25 fog = 0.125 fp. = 0.5 

aa İsi = 0.0625 fə = 0.125 f3 = 0.0625 f = 0.25 

Marginal frequency fi = 0.25 fo = 0.5 f.a = 0.25 

at locus B 


the total genetic variance are also applicable to any populations in Hardy—Weinberg 
equilibrium and linkage disequilibrium does not occur. 

Phenotypic means of the nine combined genotypes at loci A and B can be 
arranged in a two-way table (table 6.11). Population mean (denoted by G..) and 
total genetic variance (denoted by Vg) are, 


G..= 07 fiGş = 1.3125 (6.22) 
ij=1,2,3 
Ve 27 fig(Gy— G.) = 0.7148 (6.23) 
i j=1,2,3 


For the two-way table 6.11, the weighted average across rows or across columns 
can be calculated in a similar way as used in the two-way ANOVA. Taking the row in 
which genotype AA is located and the column in which genotype BB is located for 
example, 


Gi. = f1 Gu + f-2Gi2 + f.3 G13 = 1.75, 
Ga = fi-Gu + fo. Go + fk. G31 = 1.5 


In calculating total genetic variance by equation 6.23, the deviation of each 
genotypic mean (i.e., Gy) from the population mean (i.e., G..; equation 6.22) can be 
orthogonally decomposed as follows. 

Gj — G.. = (Gi. — G..)+(G.;— Go) 


+(Gj— Gi. — Gj+ G..); i= 1,2,3; j= 1,2,3 


(6.24) 


It can be seen from equation 6.24 that the deviation Gi — G.. is further 
decomposed into three parts, i.e., the main effect at locus A, the main effect at locus 
B, and the interaction effect between the two loci. It can be shown that the weighted 
average of each type of effect is equal to 0; the covariance between any two types of 
effects is also equal to 0. Such a decomposition is called orthogonal in statistics. By 
orthogonal decomposition, total genetic variance given by equation 6.23 can be 
further decomposed into three parts as well, i.e., variance at locus A, variance at 
locus B, and epistatic variance between the two loci. 
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TAB. 6.11 — Means of the nine combined genotypes at two unlinked loci showing the 9:3:4 
segregation ratio, and calculation of genetic variance in the F» population. 


Locus A Locus B Marginal Marginal Variance 
BB Bb bb mean effect at locus A 

AA Gy, = 2 Gi, = 2 Gi3 = 1 Gy. = 1.75 0.4375 0.5742 

Aa Gy, = 2 Gə = 2 Go3 = 1 Go. = 1.75 0.4375 

aa G3, = 0 Gə = 0 G33 = 0 G3. =0 —1.3125 

Marginal mean Gy. =15 Go=15 Gs —0.75 G.. = 133125 

Marginal effect 0.1875 0.1875 —0.5625 


Variance at locus B 0.1055 


Marginal effects of the three genotypes at locus A, i.e., (G;. — G..) (i = 1, 2, 3), 
are equal to 0.4375, 0.4375, and —1.3125, respectively; those at locus B, i.e., 
(G.; — G..) (j = 1, 2, 3), are equal to 0.1875, 0.1875, and —0.5626, respectively. From 
the marginal effects, genetic variances at locus A and locus B can be acquired as 
follows. 


Va= So f.(Gi. — G..)? = 0.5742 (6.25) 
i=1,2,3 

Vg = f.(G.; — G..)? = 0.1055 (6.26) 
j=1,2,3 


Denote the interaction term in equation 6.24 (also known as the epistatic devi- 
ation in quantitative genetics) by Jj, i.e., 


Ij = (Gi — Gi. — Gj + G..) (6.27) 
Given in table 6.12 are the epistatic deviations of the nine genotypes in the Fə 


population, from which the epistatic variance can be acquired, as given in 
equation 6.28. 


Vr= X` fil; = 0.0352 (6.28) 
i, j=1,2,3 


TAB. 6.12 — Calculation of the epistatic deviations and epistatic variance in the Fə population 
for two unlinked loci showing the 9:3:4 segregation ratio. 


Locus A Locus B 

BB Bb bb 
AA Lı = 0.0625 hə = 0.0625 hs = —0.1875 
Aa hı = 0.0625 log = 0.0625 Ls = —0.1875 
aa İş = —0.1875 İsə = —0.1875 133 = 0.5625 
Epistatic variance Vi = Ð; j-123 hili = 0.0352 (see table 6.10 for genotypic 


frequencies) 
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Obviously, the total genetic variance given in equation 6.23 is equal to the sum of 
the three genetic variances given in equations 6.25, 6.26, and 6.28. According to the 
theory of classical quantitative genetics, marginal effects of the three genotypes on 
locus A or locus B in the Fə population can be further decomposed into the sum of 
breeding values and dominant deviations; genetic variance at each locus can be 
further decomposed into two components, i.e., additive variance (defined by the 
additive breeding values) and non-additive (or dominant) variance (defined by the 
non-additive deviations). Using the relationship between the marginal means and 
the overall population mean, i.e., G.. = 1 Gi. + 5 Ga. + 1:Gs., it is not difficult to 
show that, 


(dd. - be - 6:)) + Də E z 2646) ) 


(Gi -6.)-İ 5 (Gh. Gab +f 5 | Ce (Gi. +G,)|} (6.29) 


In fact, equation 6.29 gives the orthogonal decomposition at locus A on the 
deviation of each genotypic mean to the overall mean of the Fə population. The 
effects given in two pairs of curly brackets are called the breeding values and 
dominant deviations for the three genotypes at locus A. It can be shown that the 
weighted mean of the three breeding values is equal to 0, as is the weighted mean of 
the three dominant deviations; the covariance between the breeding values and 
dominant deviations is also equal to 0. In contrast to the Fe. model as defined by 
equations 6.15-6.17, the decomposition defined by equation 6.29 is known as the Fə 
model in quantitative genetics. Population mean (i.e., G...) included in equations 
6.22-6.29 depends on the genotypic frequencies and is therefore a 
population-dependent parameter. It is easy to see from equation 6.29 that 
5(G,.— G3.) is actually the additive effect (also known as the breeding value) and 
162. — 1(Gi. + Gş.)) is the dominant effect (also known as the dominant deviation) 
at locus A, both calculated from the marginal genotypic means at locus A. The 
additive and dominant variances at locus A in the F» population are given in 
equations 6.30 and 6.31, respectively. 

1— 


2 
Va = : (6: - Ga.) = 0.3828 (6.30) 


1 2 
Vp fe = (Gt Ga.) = 0.1914 (6.31) 


o1 
4 
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The sum of the additive variance obtained in equation 6.30 and the dominant 
variance obtained in equation 6.31 is equal to the genetic variance at locus A, 
obtained in equation 6.25. For locus B, the additive and dominant variances can be 
calculated in a similar way, i.e., 


(igs ə 
Va=5 EG - Ga) — 0.0703 (6.32) 
1 1 ? 
ik 5 (Ga Ga)) — 0.0352 (6.33) 


Combining the two additive variances obtained from equations 6.30 and 6.32 
gives the total additive variance from the two loci; combining the two dominant 
variances obtained from equations 6.31 and 6.33 gives the total dominant variance 
from the two loci. Given the error variance, the broad-sense heritability of the 
population can be calculated as 


H? ye 


— o... 6.34 
Vet V: - 


Genetic variation caused by the epistatic effects can be used to measure the 
relative importance of epistasis between two loci. In table 6.13, different segregation 
ratios are given in order by the ratio of epistatic variance, i.e., V// Vg, from the 
smallest to largest in the Fə population. The DH population contains only the 
additive by additive epistatic effect, and the order of segregation ratios may not 
be identical to the order in the Fə population. 

In DH populations without segregation distortions, frequencies of the four 
homozygous genotypes at two loci are equal to 0.25. Following the notation of loci 
and genotypic values in table 6.11, phenotypic means of the four homozygous 
genotypes can be expressed as, 


Gu = u+ a+ a+ aa, Gigs = U+ a — a — aa, (6.35) 
G3, = U— @ + a — aa, G33 = H — a, — a + aa ` 


For the population consisting of four homozygous genotypes at loci A and B, 
each of a frequency of 0.25, the decomposition as given in equation 6.35 is also 
orthogonal. That is to say, the weighted average of effects a, and —a, for the two 
homozygous genotypes at locus A is equal to 0; the weighted average of effects aş and 
—də for the two homozygous genotypes at locus B is equal to 0; the weighted average 
of epistatic effects aa, —aa, “aa, aa for the four homozygous genotypes is equal to 0; 
and the covariance between any two kinds of effects is equal to 0. Therefore, the total 
additive variance in the population is equal to the sum of the additive variances at 
the two loci, and the epistatic variance in the population is equal to the square of the 
additive by additive epistatic effect. Taking the segregation ratio 9:3:4 as an 
example, additive effects at the two loci are 0.75 and 0.25 (table 6.9), respectively, 
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TAB. 6.13 — Genetic variance components for the 14 segregation ratios at two loci in Fy and 
DH populations. 


Ratio in F İF. population DH population 


population İy, Vp V; Ve V// Vel Va Vr Ve Vi Ve 


9:3:3:1 0.3125 0.0000 0.9375 1.2500 0.0000 1.2500 0.0000 
9:3:4 0.2266 0.0352 0.7148 0.6250 0.0625 0.6875 0.0909 
12:1:3 0.1953 0.0352 0.6211 0.6250 0.0625 0.6875 0.0909 
3:9:4 0.1328 0.0352 0.4336 0.6250 0.0625 0.6875 0.0909 
12:3:1 0.1016 0.0352 0.3398 0.6250 0.0625 0.6875 0.0909 
9:7 0.0703 0.0352 0.2461 0.1250 0.0625 0.1875 0.3333 
3:13 0.0391 0.0352 0.1523 0.1250 0.0625 0.1875 0.3333 
9:4:3 0.1563 0.1406 0.6094 0.2500 0.2500 0.5000 0.5000 
9:1:6 0.1953 0.3164 0.9023 0.1250 0.5625 0.6875 0.8182 
10:3:3 0.1016 0.3164 0.6211 0.1250 0.5625 0.6875 0.8182 
15:1 0.0078 0.0352 0.0586 0.1250 0.0625 0.1875 0.3333 
3:12:1 0.0313 0.1406 0.2344 0.2500 0.2500 0.5000 0.5000 
10:6 0.0313 0.1406 0.2344 0.0000 0.2500 0.2500 1.0000 
6:9:1 0.0078 0.3164 0.3398 0.1250 0.5625 0.6875 0.8182 


and the additive variance = 0.75? + 0.25? = 0.625 (DH population in table 6.13); 
the additive by additive epistatic effect is 0.25 (table 6.9), and the epistatic vari- 
ance = 0.25? = 0.0625 (DH population in table 6.13). 

In the Fy population, assuming there is no segregation distortion, i.e., frequen- 
cies of the nine genotypes are as shown in table 6.10, the additive variance of the 
population as calculated by equations 6.30 and 6.32 also includes the dominant and 
epistatic effects. As can be seen from table 6.9, two segregation ratios 9:3:4 and 
12:3:1 have the same additive and dominant effects, but the additive and dominant 
variances as given in table 6.13 are quite different. For other bi-parental populations 
or the cases where genotypic frequencies deviate from the expected values or the 
cases where there is the linkage between the two loci, the total genetic variance can 
still be calculated by equations 6.22 and 6.23. However, the orthogonal decompo- 
sition of genetic variance can become very difficult, or sometimes such an orthogonal 
decomposition may never exist. 

Decomposition and estimation of additive variance, dominant variance, and 
epistatic variance are the core elements in classical quantitative genetics. In QTL 
mapping studies, decomposition and estimation of various variance components are 
still of the theoretical value. However, from an applied point of view, they may not be 
the most important objectives. In contrast, phenotypic means of various genotypes as 
given in table 6.8, and the accurate estimation of various genetic effects as given in 
table 6.9 may be more important. From the phenotypic means of various genotypes as 
given in table 6.8, the genotype with the highest value for breeding can be deter- 
mined. By developing molecular markers that are closely linked to locus A and locus 
B, marker-assisted selection can be carried out in breeding populations. From the 
genetic effects in table 6.9, mean performance of various genotypes at more loci can be 
predicted, and thus the optimal genotype can be identified. If the purpose of QTL 
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mapping is to identify the favorable allele at each locus, or the optimal genotype at 
different loci, the purpose can be achieved by accurately estimating the genotypic 
value and genetic effects. In this sense, the accurate estimation of genotypic values 
and various genetic effects is particularly important in QTL mapping studies. 


6.3.3 Power Simulation of Epistatic QTL Mapping 


Assume one genome consists of 4 chromosomes, each of 140 cM in length and evenly 
distributed with 15 codominant markers, i.e., the marker density is 10 cM in the 
genome. Two interacting QTLs are located at 25 cM on chromosome 1 and 55 cM 
on chromosome 2. One trait was assumed to represent one of the 14 segregation 
ratios in the simulation. The heritability was set at 0.2 for all traits (or equally, all 
segregation ratios). Four population sizes were considered, i.e., 100, 200, 300, and 
400. A total of 56 scenarios of simulations (14 traits X 4 population sizes) were 
performed, and each scenario was repeated 100 times. Both one-dimensional and 
two-dimensional scanning is conducted by the QTL IciMapping software (Meng 
et al., 2015) on the simulated DH and F» populations. The step size for additive and 
dominant QTL mapping in one-dimensional scanning was set to 1 cM, the LOD 
score threshold was set to 2.5, and the probabilities of variables entering and leaving 
the linear model during stepwise regression were set to 0.01 and 0.02, respectively. 
The length of the support interval used in counting the detection power of additive 
and dominant QTL was 10 cM. The step size for epistatic QTL mapping in 
two-dimensional scanning was set to 5 cM, the LOD score threshold was set to 5.0, 
and the probabilities of variables entering and leaving the linear model during 
stepwise regression were set to 0.001 and 0.002, respectively. The length of the 
support squares used in counting the detection power of epistatic QTLs was 20 cM. 
Based on the mapping results from 100 simulated populations, the detection power 
was counted, and the averages of estimated positions and effects were calculated. 

Figure 6.3 shows the epistatic QTLs detected in 100 simulated DH populations at 
a size of 200. The genome is represented by a colored ring, where the four chromosomes 
are separated by colors. Each dotted line connects the two interacting loci on the 
genome; more lines indicate that the epistasis has been detected more often. Epistatic 
effects are equal to zero in the segregation ratio 9:3:3:1, and the three interactions 
detected in 100 simulations were false positives. This indicates that the threshold LOD 
of 5.0 keeps the false positives of epistatic QTLs at a low level. From table 6.13, it can 
be seen that in the DH population, the epistatic variances from ratio 12:1:3 and ratio 
3:9:4 both accounted for less than 10% of the total genetic variance, and their 
detection times would be low (figure 6.3). The epistatic variances from ratio 9:7 and 
ratio 3:13 both accounted for one-third of the total genetic variance (table 6.13), and 
their detection times would be high (figure 6.3). The epistatic variance from the 
segregation ratio of 9:1:6 accounted for more than 80% of the total genetic variance 
(table 6.13) and had the highest number of detection times in figure 6.3. 

Figure 6.4 gives the detection power of epistatic QTLs under four sizes of Fz and 
DH populations. Segregation ratio 9:3:3:1 contains only the additive and dominant 
effects, and the two QTLs cannot be detected by the epistatic mapping in simulated 
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Population size 200, heritability 0.2 


Ratio 9:3:3:1 Ratio 12:1:3 Ratio 3:9:4 


Ratio 9:7 Ratio 3:13 Ratio 9:1:6 


Fic. 6.3 — Epistatic QTL mapping in 100 simulated DH populations. Notes: Size of each 
simulated population is 200, and heritability in the broad sense is 0.2. The genome is repre- 
sented by a colored ring, with four chromosomes separated by colors. Each dotted line con- 
nects the two interacting loci on the genome, and the number above the line is the LOD score 
in testing the significance of additive by additive epistasis. 


Fə populations. The detection power of epistatic QTLs for other segregation ratios 
increases as the increase in population size. For the four sizes of Fy populations, 
detection powers are 3%, 18%, 56%, and 72% for ratio 9:3:4, and 59%, 97%, 100%, 
and 100% for ratio 9:7. Significant differences in detection power were observed for 
different segregation ratios. By comparing the genetic variances as given in 
table 6.13 with the detection powers as shown in figure 6.4A, it can be concluded 
that the greater the proportion of the epistatic variance to the total genetic vari- 
ance, the higher the detection power would be. For example, 6:9:1 has the largest 
epistatic variance in the 14 segregation ratios and its detection power can reach 97% 
in Fə populations even with the size of 100; 9:3:4 has the smallest epistatic variance 
in the 14 segregation ratios and its detection power was only 72% even in F» 
populations with the size of 400. 

A similar trend in the detection power can be observed in the simulated DH 
populations (figure 6.4B). For segregation ratios 9:3:4, 12:1:3, and 3:9:4, epistatic 
variance is less than 10% of the total genetic variance (table 6.13). Even in simulated 
populations with 400 DH lines, their detection power remains very low. In DH 
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Fic. 6.4 — Detection power of di-genic epistatic QTLs estimated from 100 simulated Fə 
(A) and DH (B) populations. Notes: Each type of population has sizes 100, 200, 300, and 400. 
For both Fy and DH populations, the two epistatic QTLs explain 20% of the phenotypic 


variance. 


populations, 10:6 had the highest epistatic variance in the 14 segregation ratios and 
therefore had the highest detection power. 

The fact that the segregation ratio 9:3:3:1 cannot be detected in two-dimensional 
scanning does not mean that the two QTLs are undetectable. In fact, they can be 
easily detected by one-dimensional scanning (figures 6.5 and 6.6). In ratio 9:3:3:1, 
both additive and dominant effects are equal to 1 at the first QTL, and equal to 0.5 
at the second QTL (table 6.9). When the F, population size was 100, the detection 
power of the first QTL was close to 100%; when the Fə population size was 200, the 
detection power of the second QTL was close to 95% (figure 6.5). The detection 
power of each of the two epistatic QTLs in the one-dimensional scanning depended 
on the proportion of additive and dominant variance to the total genetic variance. 
Ratio 6:9:1 has the highest epistatic variance, accounting for 93.1% of the genetic 
variance, and the additive and dominant variance of the two QTLs accounts for less 
than 7% of the total genetic variance (table 6.13). Thus, the two QTLs had the 
lowest detection powers during the one-dimensional scanning (figure 6.5). In DH 
populations, all genetic variance in the segregation ratio of 10:6 comes from the 
additive by additive epistatic effect, with the additive variance equal to 0. Thus, the 
detection powers of the two QTLs were close to 0 during the one-dimensional 
scanning (figure 6.6). 

Using the simulated Fj populations of size 200 as an example, the detection 
power of epistatic QTLs and the averaged estimates of two QTL positions and eight 
genetic effects are shown in table 6.14. For the segregation ratio of 9:3:3:1, no 
significant interactions were detected, so the ratio does not appear in table 6.14. For 
segregation ratios with lower detection powers, the estimates of QTL positions and 
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Fic. 6.5 — Detection power of each of the two d 
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populations in one-dimensional scanning. Notes: Each simulated population has sizes 100, 
200, 300, and 400. The two epistatic QTLs explain 20% of the phenotypic variance in Fə 
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Fic. 6.6 — Detection powers of the two di-genic epistatic QTLs in 100 simulated DH popu- 


lations in one-dimensional scanning. Notes: Each simulated population has sizes 100, 200, 300, 


and 400. The two epistatic QTLs explain 20% of the phenotypic variance in DH populations. 


genetic effects were different from their true values to some extent. For example, for 


ratio 3:9:4, the detection power was only 17%, and the averages of the estimates of 
two QTL positions were 27.35 cM and 57.06 cM, while the true positions are 25 cM 


TAB. 6.14 — Detection power of epistatic QTLs and the averaged estimates o 


F populations of size 200. 


Ratio in F, population 


9:3:4 
12:1:3 
3:9:4 
12:3:1 
9:7 
3:13 
9:4:3 
9:1:6 
10:3:3 
15:1 
3:12:1 
10:6 
6:9:1 


Power (%) 


27 
35 
17 
30 
97 
96 
99 
100 
100 
96 
100 
100 
100 


Position (cM) 


Ist QTL 
23.52 
23.43 
27.35 
28.00 
24.64 
25.99 
24.80 
25.00 
25.00 
25.47 
25.85 
25.20 
24.90 


2nd QTL 
54.63 
56.57 
57.06 
53.50 
54.69 
54.69 
55.81 
54.95 
55.05 
55.47 
55.50 
54.70 
55.05 


Genetic effect 


two QTL positions and eight genetic effects from 100 simulated 


ay dı a dy aa ad da dd 
0.36 0.20 0.12 0.04 0.28 0.32 0.35 0.40 
0.53 0.39 —0.31 -0.27 0.33 0.32 0.35 0.36 
0.56 0.52 0.04 0.11 0.33 0.29 0.28 0.29 
0.59 0.62 0.25 0.35 0.28 0.30 0.29 0.50 
0.21 0.15 0.23 0.19 0.25 0.26 0.26 0.25 
0.23 0.22 0.22 0.19 0.24 0.25 0.25 0.24 
0.47 0.39 —0.04 -0.07 0.48 0.48 0.51 0.51 
0.20 0.17 0.23 0.17 0.72 0.74 0.73 0.72 
0.22 0.19 —0.25 -0.25 0.73 0.74 0.72 0.73 
0.22 0.23 0.23 0.23 0.25 0.23 0.25 0.26 
0.45 0.47 0.03 0.04 0.49 0.5 0.50 0.51 
0.03 0.03 0.02 0.04 0.47 0.47 0.48 0.49 
0.26 0.22 0.23 0.27 0.73 0.74 0.70 0.74 


Note: proportion of total genetic variance in phenotypic variance (or heritability in the broad sense) is equal to 0.2. 


08% 
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and 55 cM, respectively. The absolute values of the estimates of additive and 
dominant effects were lower than the absolute values of the true values in table 6.9, 
while the absolute values of the estimates of various epistatic effects were higher 
than the absolute values of the true values in table 6.9. For segregation ratios with 
high detection powers, estimates of both the QTL positions and genetic effects were 
almost unbiased. For example, for ratio 3:13, detection power was 96%, and the 
averages of the estimates of two QTL positions were 25.99 cM and 54.69 cM, which 
were very close to the true positions 25 cM and 55 cM, respectively; the averages of 
the estimates of additive and dominant effects were 0.23 and 0.22, respectively, for 
the first QTL, while the true effects were both equal to 0.25; the averages of the 
estimates of additive and dominant effects were —0.22 and —0.19, respectively, while 
the true effects were both equal to —0.25; the averages of the four estimates of 
epistatic effects between the two QTLs were —0.24, —0.25, -0.25 and —0.24, 
respectively, while the true effects were all equal to —0.25. 


6.3.4 Issues in Epistatic QTL Mapping 


ICIM is an efficient method for mapping both the additive and dominant QTLs 
and epistatic QTLs in bi-parental genetic populations. When performing the 
epistatic QTL mapping, ICIM gives two LOD scores, i.e., LODA and LODAA. 
LOD, is dependent on both additive (including dominance in populations with 
three genotypes at each locus) and epistatic effects at the two current scanning 
positions, while LOD,4, is only dependent on the epistatic effects. The value of 
LODAA is generally lower than the value of LOD,A. Generally speaking, epistatic 
QTLs are much more difficult to be detected than additive and dominant QTLs. 
To ensure the reliability and accuracy of the detected epistatic QTLs, larger 
mapping populations and higher LOD score thresholds are required. QTL map- 
ping studies should focus on the additive and dominant QTLs first. If the detected 
additive and dominant QTLs can explain most heritability of the trait in interest, 
there may be no need to consider the epistatic QTL mapping. Otherwise, if the 
heritability of the trait in interest is indeed high, and the detected additive and 
dominant QTLs explain only part of the genetic variance, leaving a large amount 
of genetic variance unexplained, the epistatic QTL mapping may be considered by 
the two-dimensional scanning. 

Similar to the additive (and dominant) QTL mapping, counting the detection 
power of two interacting QTLs was restricted to a certain support interval. In this 
chapter, the length of the support interval for power analysis of epistatic QTLs was 
20 cM. In one simulated population, a pair of epistatic QTLs were considered to be 
correctly detected, i.e., true positive, if the two QTL positions corresponding to the 
significant peak in the two-dimensional LOD 44 profile were within 10 cM to the left 
and right of the two true positions, respectively. If there were multiple peaks within 
the support interval, only the highest peak was counted. Difficulties occur in 
counting the false positives of epistatic QTLs. In additive (and dominant) mapping 
of the one-dimensional scanning, all peaks above the LOD score threshold in chro- 
mosomal regions outside the support intervals were considered to be false positives. 
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When performing the power analysis of epistatic QTLs, a large number of false 
positives were sometimes observed in one simulated population when the same 
methodology was applied. This is because there are no support intervals when 
counting the false positives. When the LOD score surface from the two-dimensional 
scanning has multiple significant peaks in neighboring regions of true QTL posi- 
tions, they are counted as multiple false positives, thus increasing the false positive 
rate. In addition, the epistatic QTLs were counted as false positives whenever either 
position of the two QTLs was outside the support interval. In the power analysis of 
epistatic QTLs, there are several factors that may overestimate the false positives. 
How to effectively estimate the false discovery rate during the epistatic QTL map- 
ping needs to be further investigated in theory. 


6.4 Mapping of the QTL by Environment Interactions 


Gene (or QTL) by environment interaction (QEI) widely exists in genetics. Studies 
on QEI contribute to a better understanding of the genetic architecture of important 
quantitative traits and the genotype by environment interactions. Identification of 
environment-stable QTLs having consistent genetic effects across a wide range of 
environments is of great importance to plant breeding. Taking DH populations as an 
example in this section, we briefly introduce the extension of ICIM to QEI mapping 
on phenotypic observations from multi-environmental trials. The readers can also 
refer to Li et al. (2015) for more information. 


6.4.1 Mapping of the Additive QTL by Environment 
Interactions 


Assume a mapping population consists of a number of n DH lines, which have been 
evaluated in a number of e environments on a quantitative trait of interest, and 
genotyped with a number of m + 1 ordered markers. From chapter 5, the following 
linear regression model on marker variables can be built for the phenotypes in each 
environment, 


m+ 


Yin = Bon + — Pini + Ein (6.36) 
j=l 


where x), is the phenotypic value of the ith DH line in the Ath environment, 
i= 1, 2, ..., n, and h = 1, 2, ..., e: Pon is the overall mean of the linear model in the 
hth environment; zy is the indicating variable for the jth marker type in the ith DH 
line, which is equal to 1 or —1 standing for each parental type; 6), is the partial 
regression coefficient of phenotype on the jth marker in the hth environment; and sin 
is the residual random error in the hth environment that is assumed to be 
independent and normally distributed with the mean of zero, and variance of oğ. 
Stepwise regression can be therefore conducted to select significant markers for 
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phenotypic values from each environment, similar to the additive mapping 
introduced in §5.2 in chapter 5. 

Assume the current scanning position is located within the marker interval con- 
sisting of the Ath and (k + 1)th markers. The phenotypic values in each environment 
are adjusted by the significant markers retained in the stepwise regression model, i.e., 


Ayn ya — XO Bray (6.37) 


ök,k--1 


Let 4p and Mə, be the phenotypic means of two QTL genotypes QQ and qq, 
respectively, in environment h. Similar to the QTL mapping in individual 
environments, define the overall mean u, and the additive effect a, in each 
environment, i.e., 


Lin = By ah, Mon = Ha — ah (6.38) 


In addition, define the average additive effect @ and the additive by environment 
interaction effect ae, as follows, 


1 e 
G= -> ah, den = ay, — T (6.39) 
2 h=1 


For those effects included in equation 6.38, a number of hypothesis tests can be 
made for different purposes. Of course, the first thing we would like to know is 
whether the QTL is present at the current scanning position; secondly, if the QTL 
is present at the current scanning position, whether the performance of this 
QTL is stable; and finally, if the performance of the QTL is not stable, which 
environments have the largest interactions, and how large are the interaction 
effects, etc. 

By intuition, if the additive effect a, (h = 1, 2, ..., e) is significantly different 
from 0 for at least one environment, the current scanning position is considered to be 
present with QTL. Thus, the presence of QTL can be tested using the following two 
hypotheses. 


Ho : Hin = Hop (or an = 0) for any environment h (h = 1,2,..., e) 


H; : Li) Æ Hop (or an 4 0) for at least one environment 


Under hypothesis Hı, QTL genotypes QQ and qq follow different normal dis- 
tributions under environment h, denoted by N (u1, o?) and N (Hop, 2), respectively. 
Therefore, the log-likelihood function is, 


€ 4 
nl, = 5 5 nini f(Ayin; Mh» oz) + Tl (Ayin; Mons o°)] (6.40) 
h=1 k=l ieS, 


where S, denotes the set of DH lines belonging to the kth marker type group (k = 1, 
2, 3, 4); i € S, denotes that marker type of the ith DH line belongs to the Ath marker 
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type group; Ty is the probability of the th (l= 1, 2) QTL genotype under the kth 
marker type group, as given in table 5.2 in chapter 5; and f(e; up, o?) is the density 
function of normal distribution N(uyn, 02). 

Clearly, null hypothesis Hp is equivalent to the average additive effect a@= 0, 
while for any environment h, the additive by environment interaction effect aep = 0. 
In this situation, two QTL genotypes QQ and qq follow the normal distribution 
N(u,, 02) in environment h, and the log-likelihood function is, 


In Zo = 2.” So In f(Ayin Hn oz) (6.41) 


h=1 i=1 


Solving the likelihood equations of equations 6.40 and 6.41 acquires the maxi- 
mum likelihood estimates of parameters under both hypotheses. In the meantime, 
the hypothesis testing statistic (denoted by LOD, i.e., equation 6.42) is also 
acquired and then used to test for the presence of QTL. 


LOD, = max logy, Lı — max log,) Lo (6.42) 


In equation 6.39, the average additive effect a actually measures the environ- 
mental stability of the QTL. If a is significantly different from 0, the QTL is con- 
sidered to be stable; otherwise, the QTL is considered to be unstable, and the genetic 
effect of the QTL is dominated by the environmental interactions. To further test 
the stability of QTL, another hypothesis H, is made, and the log-likelihood function 
of this hypothesis is given in equation 6.43. 


H, : Hp isnot true, but a = 0, and 


e 4 
İn Lə — 2” 2,2. Infan / (Agin: Mins 02) + nrf (Ayin; Həp, 72) — 24 
hol k=1 i€ S, 


(6.43) 


where 2 is called the Lagrange multiplier of the restriction given in hypothesis Hb. 
From equations 6.38 and 6.39, we have q 15”) 1520 — Mon). Therefore, 
restriction @ = 0 in hypothesis Hy is equivalent to 377 (Ha — Hon) = 0. Using the 
EM algorithm and conditional maximization, maximum likelihood estimates of 
parameters and the maxima of likelihood function can be acquired. The difference 
between the maxima of equation 6.43 and the maxima of equation 6.42 gives the 
statistic (denoted by LOD») to test the significance of the main effect of the QTL 
across the environments, i.e., q. 


LOD» = max logio L2 — max logy, Lı (6.44) 


Difference between the maximum log-likelihood functions from hypotheses Hə 
and Hp gives the statistic to test the significance of QTL by environment interaction 
effects, denoted as LOD; (equation 6.45). 


LOD; = max logio Lə — max logio Lo (6.45) 


Hypothesis H, does not have any constraints on parameters Mi and Həp (h = 1, 
2, ..., e) to be estimated. Hypothesis Hy assumes that the two QTL genotypes have 
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equal phenotypic mean in each environment, i.e., Hun = Həp for any h. Thus, the 
difference in the maximum likelihood function between hypotheses H, and Ho 
measures the likelihood of the presence of QTL at the current scanning position 
(equation 6.42). Hypothesis H> is equivalent to adding a constraint, i.e., @= 0, to 
hypothesis Hı. The difference between Hı and Hy is whether or not to have the 
constraint (equation 6.44). When the constraint @ = 0 is true, the difference between 
the maximum likelihood functions under the two hypotheses will be small, and the 
calculated LOD score will be close to 0. If the constraint @ = 0 does not hold, the 
calculated LOD score will be large. In the QTL mapping study, LOD, obtained from 
equation 6.42 can be used to test the presence of QTL, LOD», obtained from 
equation 6.44 to further test the significance of QTL main effect across the envi- 
ronments, and LOD; obtained using equation 6.45 to further test the significance of 
the QTL by environment interaction effects. 

For QTLs passing the significant test by statistic LOD, it does not mean that 
the additive effect is significantly different from 0 in every environment. For QTLs 
passing the significant test by statistic LODs, it does not mean that the additive by 
environment interaction effect is significantly different from 0 in every environment, 
either. In theory, the significance of additive effects and QTL by environment 
interactions in specific environments can still be tested by constructing other 
hypotheses, which are not covered in this book. In practice, the significance of the 
QTL under a specific environment can be roughly determined by the magnitude of 
the estimate on average additive effect. The importance of the environmental 
interactions can be roughly made from the magnitude and direction of the estimated 
interaction effect. 


6.4.2 Mapping of the Epistatic QTL and Environment 
Interactions 


In this section, we still use the DH population as an example to briefly introduce the 
ICIM mapping method for epistatic QTLs and environment interactions. Assume 
that there are e environments of phenotypic data for a trait in interest, and geno- 
typic data for a number of m + 1 ordered markers in one population consisting of 
n DH lines. To fit both the additive and epistatic effects in the DH population, the 
linear regression model for each environment is, 


ml 


Yin = Pon + > Bazu + ” Pjk hij Tik + Sin (6.46) 
j=1 


j j<k 


where gi, is the phenotypic data for the ith DH line in the hth environment, i = 1, 2, 
və nand h = 1, 2, ..., e; Bo, is the constant in the Ath environment; zy is the marker 
indicator for the jth marker in the ith DH line that does not vary by environment, 
with the values 1 and —1 for the two parental marker types; $), is the regression 
coefficient of the jth marker variable in the hth environment; $, x is the regression 
coefficient of the interaction between the jth and kth markers in the hth 
environment; ei, is the random error effect following the normal distribution with 
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mean 0 and variance o?, i.e., the error variance is considered to be homogeneous 
across environments. During the two-dimensional scanning for epistatic QTLs, the 
two marker intervals at the current scanning positions are denoted as (j, 7 + 1) and 


(k, k + 1), satisfying j < k, and the phenotypic observations are adjusted by, 


Ayin = Yih — 5 Brn ir un — Bra hir Bis (6.47) 
réjg+Lkk+1 rjj+1 
s#k,k+1 


where Brn and Brah are the estimates of corresponding parameters in the linear 
model of equation 6.46. 

Let Min, Hoh; Hap and Hyp be the phenotypic means in environment h for the four 
QTL genotypes at two scanning positions. Similar to the QEI mapping introduced 
in the previous section, the overall mean p,, additive effects aip and azp of the two 
QTLs, and one additive by additive epistatic effect aa; between the two QTLs can be 
defined in the following equations for each environment. 


Lih = Hn + h daa dün, Hon = Hp + An — dön — GAp, 


Hə = Up — Uh F dön — düha, Han = My T Uh — dön t+ aap 


At the first scanning position, the average additive effect dı and additive by 
environment interaction effect aein are defined by equation 6.48. At the second 
scanning position, the average additive effect G and additive by environment 
interaction effect aes, are defined by equation 6.49. The average epistatic effect wa 
together with the epistasis by environment interaction effect aae, are defined by 
equation 6.50. 


1 e 


Qy —— ) Ain, Ain = din — Ay, (6.48) 
e 
h=1 
1 e 
ū = — > döh, Gen = A2h — də (6.49) 
e 
h=1 
1 e 
ua = -+= ) aah, aaen = aay, — dd (6.50) 
ae 


QTL mapping of epistasis and environment interaction contains more parame- 
ters and genetic effects, and more hypothesis tests as well. Only the test for the 
presence or absence of epistatic QTLs at two scanning positions is introduced below. 
When the epistatic QTLs are present, the tests for the significance of particular 
effects as defined in equations 6.48-6.50 are not considered. In the actual mapping 
studies, the significance of various effects can be roughly determined from the 
magnitude and direction of their estimated values. Null and alternative hypotheses 
to test the presence of epistatic QTLs are, 
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Ho : liy = Mon = Hap = Han for any environment h (h = 1,2,..., e) 


H; : noconstraints for the four genotypic means for any environment 


Under hypothesis H,, assuming the kth QTL genotype (k = 1, 2, 3, 4) follows a 
normal distribution N(j,,,¢2) in environment h, the log-likelihood function is, 


In Lı = yen 


=] j=1 i€S) 


Yo mal Ag: Hip, ei (6.51) 


k= 


where S, denotes the set of DH lines having the jth marker type (j = 1,..., 16): ay 
(k = 1,...,4) is the proportion of the kth QTL genotype in the jth set, which is given 
in table 6.1; and f(e; uy, o?) denotes the density function of normal distribution 
N(tyns oz). 

Equation 6.52 gives the log-likelihood function under the null hypothesis Hp. 
Based on the likelihood functions in equations 6.50 and 6.51, the LRT statistic or 
LOD score can be calculated and then used to test for the significance of total 
genetic variation at both scanning positions. 


In In = + omy (Ayin; İl, © öz) (6.52) 


=l i=1 


6.4.8 QTL and Environment Interactions in One 
Actual RIL Population in Maize 


Mapping methods introduced previously for additive by environment and epistasis 
by environment interactions are immediately applicable to other populations with 
two genotypes at each locus. For populations with three genotypes at each locus, 
more genetic effects are present, and more marker variables have to be considered in 
the linear regression model on phenotypic observations. QTL mapping of environ- 
mental interactions can be conducted by a functionality called MET in the QTL 
IciMapping software (Meng et al., 2015) for 20 bi-parental populations. In this 
section, the first RIL population in maize nested association mapping 
(NAM) (Buckler et al., 2009; McMullen et al., 2009) was used as an example to 
demonstrate the outcomes that QEI mapping can provide. 

Parents to make the F, hybrid were inbred lines B73 (coded by 0) and B97 
(coded by 2). At each marker locus in the RIL population, the indicator variable is 
equal to —1 for the B73 marker type, and 1 for the B97 marker type. The F; hybrid 
was selfed for five successive generations to form the mapping population consisting 
of 194 RILs. The linkage map was constructed from 237 SNP markers, covering the 
10 chromosomes of maize. The missing rate of marker data was 8.0%. The heading 
(or male flowering) date of the population was investigated in field trials under three 
environmental conditions, denoted by E,, Fə, and Ez. In QET mapping, the proba- 
bilities of variables entering and leaving the linear regression model were set to 0.001 
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and 0.002, respectively, the scanning step was set to 1 cM, and the LOD threshold 
was set to 2.5. Meanwhile, the three environments were treated as three traits, and 
QTL mapping was conducted for each environment using the same mapping 
parameters as used in QEI mapping. 

LOD profiles from QTL by environment mapping and QTL mapping for the 
three individual environments are shown in figure 6.7. No QTLs were detected on 
chromosomes 5-10, and therefore only chromosomes 1-4 are included in the LOD 
score profiles. A total of nine peaks above the threshold of 2.5 were observed on the 
LOD score profile from QEI mapping (figure 6.7A). That is to say, nine QTLs were 
identified to control the heading date trait in the RIL population, four on chro- 
mosome 1 (denoted by qHD1-1-qHD1-4, two on chromosome 2 (denoted by qHD2-1 
and qHD2-2), two on chromosome 3 (denoted by qHD3-1 and qHD3-2), and one on 
chromosome 4 (denoted by qHD4) (figure 6.7A). No peaks above the LOD score 
threshold were observed on other chromosomes. 

For the four QTLs on chromosome 1, there were peaks near the position of 
qHD1-1 on the LOD score profiles from environments E and Es, but none of them 
exceeded the threshold of 2.5 (figure 6.7B and C). There was one peak near the 
position of qHD1-1 exceeding the threshold on the LOD score profile from envi- 
ronment E; (figure 6.7D). Therefore, when QTL mapping was conducted for each of 
the three environments, qHD1-1 was only detected in environment Fe. Similarly, 
qHD1-2 was only detected in environment Eş, qHD1-3 was not detected in any of the 
three environments, and qHD1-4 was detected in each of the three environments 
(figure 6.7B-D). For qHD2-1, qHD2-2, and qHD3-2, the LOD score profiles from the 
three environments had no peaks above the threshold at nearby positions 
(figure 6.7B-D), and therefore the single-environment QTL mapping did not detect 
these QTLs. qHD3-1 was detected in environment E; but not in environments Eş 
and E3; qHDA was detected in environment E3 but not in environments E; and E». 

Table 6.15 gives detailed information on the nine QTLs detected by the joint 
QTL mapping across the three environments. Varied additive effects were observed 
from different QTLs and different environments. However, each QTL has a consis- 
tent direction in additive effects across the three environments. In the RIL popu- 
lation, the marker type of parental line B73 was coded as 0, and the allele it carries 
at one locus is denoted by q. The marker type of parental line B97 was coded as 2, 
and the allele it carries at one locus is denoted by Q. qHD1-4 has the highest LOD 
score among the nine QTLs, with an average additive effect of —0.67. Thus, the 
heading date of genotype QQ would be reduced by 0.67 days from the mean; the 
heading date of genotype qq would be delayed by 0.67 days from the mean, as far as 
the qHD1-4 locus is concerned. Allele Q is present in parent line B97, therefore the 
allele carried by B97 at qHD1-4 would reduce the heading date by an average of 
0.67 days in the three environments. At qHD1-3, qHD2-1, and qHD4, the alleles 
carried by parent line B97 also reduce the heading date consistently under the three 
environments. For the other five QTLs, the average additive effect was positive 
(table 6.15), and the allele that reduces the heading date comes from parent B73. 
The mapping results suggest that new inbred lines with shorter or longer heading 
dates than both parents can be developed by combining suitable alleles at the 
detected QTLs. 
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Fic. 6.7 — LOD score profile from QTL by environment mapping and QTL mapping for the 
three individual environments on heading date in one RIL population in maize. Notes: 
(A) QTL by environment mapping across three environments; (B) QTL mapping for envi- 
ronment E,; (C) QTL mapping for environment Ey; (D) QTL mapping for environment Ez. 


TAB. 6.15 — Positions (cM) and effects of the nine QTLs detected from QET mapping for 
heading date investigated in three environments (i.e., E,—-E3). 


QTL Pos. LOD Additive effect Average Additive by 
name (cM) score effect environment 
interaction 
Bı E2 Es Bı E2 Es 

qHD1-1 43 2.92 0.50 0.29 0.31 0.37 0.13 —0.07 —0.06 
qHD1-2 85 8.64 0.52 0.52 0.89 0.64 —0.12 —0.13 0.25 
qHD1-3 133 5.09 0.36 —0.44 —0.69 —0.50 0.13 0.06 “—0.20 
qHD1-4 172 9.46 0.76 —0.57 —0.67 —0.67 —0.09 0.09 0.00 
qHD2-1 20.5 4.03 0.51 —0.66 —0.12 -—0.43 —0.08 —0.23 0.31 
qHD2-2 110.5 2.91 0.21 0.52 0.35 0.36 —0.15 0.16 —0.01 
qHD3-1 53 3.77 0.84 0.25 0.18 0.42 0.42 -—Ü.18 —0.24 
qHD3-2 110 4.25 0.37 0.58 0.38 0.44 —0.08 0.14 —-0.07 
qHD4 53 4.38 0.28 —0.48 —0.60 —0.46 0.17 “0.03 —0.14 


Note: Due to the truncation errors, the calculating results from values given in columns 4-6 and 
equation 6.39 may not be exactly the same as those given in the last four columns in the table. 


290 Linkage Analysis and Gene Mapping 


For the nine QTLs given in table 6.15, there was no QTL with positive effects 
under some environments but negative effects under others. Therefore, all of them 
had stable genetic effects across the three environments. It can be seen from the last 
three columns of table 6.15, most of the interaction effects were much lower than the 
average additive effects, indicating that the interaction between QTL and envi- 
ronment may not be the main genetic factor determining heading date in this 
mapping population. 

To further illustrate some advantages of the combined QTL mapping across 
multi-environments, results from the single-environment QTL mapping are given in 
table 6.16. Two QTLs were detected in environment E,, located on chromosomes 1 
and 3. One QTL was detected in environment Es, located on chromosome 1. Five 
QTLs were detected in environment E3, three located on chromosome 1 and two on 
chromosome 4 (table 6.14). Fewer QTLs were detected in individual environments. 
As can be seen in the last column of table 6.16, except for the QTL located at 
133 cM on chromosome 4 from environment E3, all QTLs detected in individual 
environments were detected by the combined QTL mapping. However, four of the 
nine QTLs detected by the combined QTL mapping (i.e., qHD1-3, qHD2-1, 
qHD2-2, and qHD3-2) were not detected by the single-environment QTL mapping. 

In addition, due to random errors, one QTL may be estimated at different 
positions in different environments. For example, qHD1-4 was located at three dif- 
ferent positions at 173, 187, and 169 cM in the three environments, respectively. The 
combined QTL mapping is based on more phenotypic data, and the estimate of the 
QTL position at 172 cM (table 6.15) may be much closer to the true position. 
In QTL mapping studies where phenotypic observations from multiple environments 
are available, the combined analysis across environments should be used as much as 
possible. 

In some theoretical studies, it is suggested to conduct the combined analysis on 
phenotypic observations first (Messmer et al., 2009; Malosetti et al., 2004). For 
example, the best linear unbiased prediction (BLUP) on genotypic values was 
acquired on the multi-environment phenotypic data, followed by QTL mapping on 
BLUP values. With such an approach, only QTLs that are stably expressed across 
environments may be detected, and the effects of QTL by environment interactions 
cannot be studied. In practice, a better choice may be the simultaneous use of 
combined mapping across environments and single-environment mapping, from 
which the novelty and value of the detected QTLs may be better evaluated 
(Yin et al., 2015, 2017). Genetic architecture of quantitative traits could be more 
complicated than additive, dominance, and di-genic epistatsis. The environments 
where the traits are evaluated can be equally complicated. Though we see benefits to 
applying the QEI mapping for multi-environmental trials, we do not exclude the use 
of QTL mapping on each environment, and then summarize the mapping results 
across the environments. Neither do we exclude the use of the estimated breeding 
values in QTL mapping where the target is to locate the highly-adapted and stable 
genes. 
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TAB. 6.16 — Positions and effects of QTLs detected by QTL mapping on each of the three 
environments (1.e., E,—-Es). 


Env. Chr. Pos. Left Right LOD PVE Add. QTL given 
(cM) marker marker score (%) in table 6.15 

Ey 1 173 L00165 L01039 4.42 9.93 —0.81 qHD1-4 

Ey 3 54 L00071 L00951 5.28 11.38 0.87 qHD3-1 

E2 1 187 L00742 L00222 2.69 8.17 —0.86 qHD1-4 

Es 1 31 L00828 L01110 387 693 0.68 qHD1-1 

E3 1 85 L00789c6 L00916 4.53 7.82 0.71 qHD1-2 

Bö 1 169 L00165 L01039 6.58 11.75 -0.87 qHD1-4 

E3 4 48 L00441 L00042 2.72 4.54 —0.55 qHD4 

Es 4 133 L00988 L00841 2.60 4.93 0.56 undetected 


Exercises 


6.1 There are four near-isogenic lines in wheat with different alleles only at two loci 
(i.e., A and B). Loci A and B are located on two different chromosomes of wheat, 
and genotypes of the four lines are denoted by AABB, aaBB, AAbb, and aabb. Three 
replicated observations for a quality trait are shown in the following table. 


Replicate Near isogenic line 
AABB aaBB AAbb aabb 
232 224 219 150 
231 218 211 152 

3 242 200 209 151 


(1) Conduct ANOVA to test the significance of difference among the four 
near-isogenic lines on the quality trait. 

(2) Estimate the additive effects at locus A and locus B, and additive by additive 
epistatic effect between the two loci; conduct ANOVA to test the significance of 
the epistatic effect. 

(3) Use the fix-effect linear model of ANOVA to estimate the additive variance at 
each locus, the epistatic variance between the two loci, and the proportions of 
the three genetic variance components to the total phenotypic variance. 


6.2 In one epistatic QTL mapping study on a DH population, the phenotypic means 


of QTL genotypes Qı Qı 2502, Qı Qıqəqə, qiqi Qə Qə, and q4142q@ are estimated at 
11.7, 6.9, 10.1, and 11.3, respectively, during the two-dimensional scanning. The 
phenotypic variance of the DH population is estimated at 12.5. 


(1) Estimate the additive effects at locus Qı and locus Qə, and the additive by 
additive epistatic effect between the two loci. 
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Estimate the additive variance and epistatic variance at the two scanning 
positions. 

Estimate the proportion of phenotypic variance explained by all genetic effects 
at the two scanning positions. 


6.3 There are genetic interactions between two unlinked loci A and B. Segregation 
ratio 9:7 occurs in one Fs population produced by crossing the two genotypes AABB 
and aabb as parents. Assume there is no segregation distortion, and values of the two 
phenotypic classes are equal to 1 and 0, respectively, i.e., 


Genotype BB Bb bb 

AA Gy =1 Giz = 1 Gi = 0 
Aa Go, = 1 Gz = 1 Go3 = 0 
aa Gə = 0 G32 = 0 G33 = 0 


Calculate the additive and dominant effects at the two loci, and the four epi- 
static effects between the two loci. 

Calculate the population mean and total genetic variance in the Fə population. 
Calculate the marginal means and marginal genetic effects of the three geno- 
types at locus A, and the genetic variance generated by locus A. 

Calculate the marginal means and marginal genetic effects of the three geno- 
types at locus B, and the genetic variance generated by locus B. 

Calculate the epistatic deviations between loci A and B, and the epistatic 
variance in the F, population. 


6.4 In one epistatic QTL mapping study in an Fə population, estimates of the phe- 
notypic means of nine QTL genotypes at two scanning loci A and B are shown in the 
following table. The phenotypic variance of the Fy population is known to be 100. 


Genotype BB Bb bb 

AA Gi, = 12 Gig = 23 Gi3 = 15 
Aa Go, = 32 Goo = 29 Go3 = 16 
aa G3, = 21 G32 = 28 G33 = 17 


Calculate the additive and dominant effects at the two scanning loci, and the 
four epistatic effects between the two scanning loci. 

Calculate the total genetic variance at the two scanning loci. 

Calculate the marginal means and marginal genetic effects of the three geno- 
types at locus A, and the genetic variance generated by locus A. 

Calculate the marginal means and marginal genetic effects of the three geno- 
types at locus B, and the genetic variance generated by locus B. 

Calculate the epistatic deviations between locus A and locus B, and the epi- 
static variance in the Fy population. 
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(6) Calculate the proportions of phenotypic variance explained by additive, dom- 
inant and epistatic effects, respectively. 


6.5 Pick up one DH or RIL population included in the QTL IciMapping software to 
conduct the epistatic QTL mapping using both IM-EPI and ICIM-EPI mapping 
methods implemented in the software. 


(1) Plot the 3D profiles of the LOD score and genetic effects from the 
two-dimensional scanning for both mapping methods. 

(2) Tabulate the relevant information of the epistatic QTLs detected by both 
mapping methods, including chromosomal locations, nearest markers on either 
side of each detected QTL, and genetic effects. 

(3) Compare the epistatic mapping results and identify the major differences 
between the two mapping methods. 


6.6 Pick up one Fə population included in the QTL IciMapping software to conduct 
the epistatic QTL mapping using both IM-EPI and ICIM-EPI mapping methods 
implemented in the software. 


(1) Plot the 3D profiles of the LOD score and genetic effects from the 
two-dimensional scanning for both mapping methods. 

(2) Tabulate the relevant information of the epistatic QTLs detected by both 
mapping methods, including chromosomal locations, nearest markers on either 
side of each detected QTL, and genetic effects. 

(3) Compare the epistatic mapping results and identify the major differences 
between the two mapping methods. 


6.7 Given the genetic model and parameter settings in table 6.2, use the BIP sim- 
ulation functionality implemented in the QTL IciMapping software to conduct the 
detection power analysis of epistatic QTLs in RIL populations and compare the 
power analysis results with the results from DH populations. 


6.8 For the mapping population in §6.4.3, phenotypic means of two QTL genotypes 
QQ and qq at a certain scanning position for flowering time are obtained from the 
joint analysis of three environments and given in the following table. 


Environment QQ qq 

FE, 71.15 72.78 
E> 77.62 78.81 
Es 69.81 71.58 


(1) Caleulate the additive effects of the QTL at the scanning position in the three 
environments. 

(2) Caleulate the average additive effect of the QTL at the scanning position, and 
the additive by environment interaction effects. 


Chapter 7 


Genetic Analysis in Hybrid 
F, of Two Heterozygous 
Parents and Double-Cross F, 
of Four Homozygous Parents 


Linkage analysis and QTL mapping methodology introduced in previous chapters 
are mainly focused on genetic populations derived from two homozygous parents. 
However, sexual and asexual propagated species normally have the problem known 
as inbreeding depression, or cannot conduct the inbreeding propagation at all. 
Therefore, pure lines of homozygous genotypes cannot be generated, and genetic 
studies cannot be conducted in the progeny populations from the pure-line parents. 
This chapter will focus on genetic analysis in the hybrid F; progeny from two parents 
having heterozygous genotypes (or heterozygous parents in short): one is used as 
female and the other one as male. Coincidentally, if two single-cross Fıs from four 
pure-line parents are treated as the female and male parents, the cross between the 
two single-cross Fıs would generate a double-cross Fi population. At one single 
locus, population structures are exactly the same between the hybrid F, from two 
heterozygous parents and the double cross F, from four pure-line parents. 

When considering two linked loci together, the linkage phase of the double 
heterozygotes that occurred in two single crosses can be determined by the geno- 
types of the four homozygous parents. There is no problem pending on linkage 
phases in two single crosses, acting as two parents of the double cross. However, the 
linkage phase in the heterozygous female and male parents is generally unknown, 
which needs to be determined by linkage analysis in their Fı progeny population. If 
the linkage phases of two heterozygous parents have been determined, the two 
parents can be treated as two F, hybrids from four virtual homozygous parents, and 
the F, hybrids from two heterozygous parents become completely equivalent to the 
double cross F; from four pure-line parents. For this reason, linkage analysis and 
QTL mapping methods will be considered together for both populations in this 
chapter. 
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For contents about linkage analysis and linkage map construction, the readers 
can also refer to Zhang et al. (2015a); for contents about the imputation of 
incomplete and missing marker information and the followed QTL mapping, refer to 
Zhang et al. (2015b); for contents about the corresponding analysis software, refer to 
Zhang et al. (2015c). Heterozygous parents in this chapter may be two individuals in 
a random mating population, two clonal cultivars, or two single crosses from four 
homozygous inbred lines. Therefore, genetic analysis methods introduced in this 
chapter are applicable to the full-sib families in random mating species, Fı popu- 
lations in asexually propagated species, double cross F; populations from four 
pure-line parents, and so on. By the way, the F; hybrids between two heterozygous 
parents were called the clonal F; in Zhang et al. (2015a). 


7.1 Linage Analysis in the Hybrid F, Derived from Two 
Heterozygous Parents 


7.1.1 Categories of Polymorphism Markers 


Only diploid species are considered. Markers in the following sections can be 
understood as polymorphic loci without any obvious phenotypic effects, but can also 
be genetic loci having two or more alleles affecting the phenotypic trait in interest. 
At one single marker or gene locus, A and B are used to represent the two alleles in 
the female parent; C and D represent the two alleles in the male parent. Therefore, 
the genotypes of the two heterozygous parents are AB and CD, and the four 
genotypes in the F, progeny are AC, AD, BC, and BD. Whether the four genotypes 
in F, progeny can be completely or partially distinguished depends on the poly- 
morphism of the four alleles in the two parents and their progenies (figure 7.1). 
Based on the genotypes of two parents and genotypes of their F; progenies, each 
locus can be classified into one of the following four categories. 

For Category I (or ABCD) markers, alleles A and B in the female parent can be 
separated, alleles C and D in the male parent can be separated, and two parental 
genotypes AB and CD can be separated. Four genotypes can be distinguished in the 
F; progeny, represented by AC, AD, BC, and BD. When no distortion occurs, the 
four identifiable genotypes follow the Mendelian ratio of 1:1:1:1 (figure 7.1). Cate- 
gory I markers can also be called the fully informative markers, or the complete 
markers in short. It should be mentioned that it is not necessary that all four alleles 
are unique. In Category I, it can happen that one female allele is the same as one 
male allele. For example, when allele A is the same as allele C, two parental geno- 
types AB and CD (equal to AD) are still identifiable, and so are the four genotypes 
AA, AD, BA, and BD in the progeny. Such a marker is still classified into Category I. 

Category II (or A=B) represents the case of male polymorphism. That is to say, 
the markers do not show polymorphism in the female parent, but show polymor- 
phism in the male parent. For Category II markers, two alleles in a female parent 
cannot be separated. Therefore, genotypes AC and BC cannot be separated; neither 
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Fic. 7.1 — Four categories of polymorphism markers in two heterozygous parents and their F, 
progenies. Notes: In Category I or ABCD, both parents are polymorphic, and four genotypes 
in the progeny follow the Mendelian ratio of 1:1:1:1. In Category II or A=B, the female parent 
has no polymorphism, and two genotypes in the progeny follow the Mendelian ratio of 1:1. In 
Category III or C=D, the male parent has no polymorphism, and two genotypes in the 
progeny follow the Mendelian ratio of 1:1. In Category IV or AB=CD, both parents have 
polymorphism and show the same heterozygous genotype. Three genotypes in the progeny 
follow the Mendelian ratio of 1:2:1. 


can genotypes AD and BD. Only two genotypes can be observed in the F; progeny, 
which is denoted as XC and XD (figure 7.1) where X stands for either allele A or 
allele B. In this category, XC contains genotypes AC and BC; XD contains geno- 
types AD and BD. When no distortion occurs, the two identifiable genotypes XC 
and XD follow the Mendelian ratio of 1:1. 

Category III (or C = D) represents the case of female polymorphism. That is to 
say, the markers show polymorphism in the female parent, but do not show poly- 
morphism in the male parent. For Category III markers, two alleles in a male parent 
cannot be separated. Therefore, genotypes A Cand AD cannot be separated; neither 
can genotypes BC and BD. Only two genotypes can be observed in the F; progeny, 
which is denoted as AX and BX (figure 7.1) where X stands for either allele C or 
allele D. In this category, AX contains genotypes AC and AD; BX contains geno- 
types BC and BD. When no distortion occurs, the two identifiable genotypes AX 
and BX follow the Mendelian ratio of 1:1. 
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Category IV (or AB = CD) represents the case of co-dominance. In this cate- 
gory, both parents have two identifiable alleles with exactly the same genotype. 
That is to say, the two alleles in the female parent are identical to the two alleles in 
the male parent. The two identical alleles are represented by A and B, and the same 
parental genotype is represented by AB. Three genotypes can be observed in the F; 
progeny, i.e., AA, AB, and BB. When no distortion occurs, the three identifiable 
genotypes follow the Mendelian ratio of 1:2:1 (figure 7.1). 

For markers belonging to Category I, the number of alleles can be viewed as four; 
each parent carries two different alleles. For markers belonging to Categories II and 
III, the number of alleles can be viewed as three. One parent carries two different 
alleles in the state of heterozygote, and the other parent carries one allele in the state 
of homozygote. The genetic constitution in the progeny is similar to the backcross 
populations from two homozygous parents as introduced in chapter 1. For markers 
belonging to Category IV, the number of alleles can be viewed as two, and the two 
alleles are present in the state of heterozygote in both parents. The genetic consti- 
tution in the progeny is similar to the Fə population from two homozygous parents 
as introduced in chapter 1. 


7.1.2 Unknown Linkage Phases in Heterozygous Parents 
and Genotypes in Their F, Progenies at Two Loci 


Similar to bi-parental populations introduced in previous chapters, when using the 
F; populations derived from heterozygous parents in genetic studies, the first step is 
still to screen the parents for polymorphic markers and conduct the genotyping test 
in progenies. In the previous section, four marker categories have been given on 
identifiable genotypes in parents and their progenies. Two loci have to be considered 
jointly in linkage analysis. The two markers may belong to the same category or two 
different categories. In this section, the most ideal situation will be considered first 
for two markers both belonging to Category I. Linkage analysis including incomplete 
markers will be handled in §7.2. 

Take the female parent as an example. Let A) and B, be the two alleles at one 
locus; A> and Bə be the two alleles at the other linked locus. This does not tell 
anything about the linkage relationship of alleles at the two loci. In reality, A; and Aş 
can be linked on one homologous chromosome and B, and By on the other homol- 
ogous chromosome. But there is one other possibility, namely, A, and Bə are linked 
on one homologous chromosome, and B, and Aş are linked on the other homologous 
chromosome. The same problem occurs in the male parent. This is actually the 
problem of unknown linkage phases briefly mentioned at the beginning of this 
chapter. Without additional information on the origin of two heterozygous parents, 
whether the genotype of the female parent is A, A/B, Bə or A,By/ B,Aş is unknown 
before genetic analysis, where “/” is used to separate two homologous chromosomes. 
Similarly, the genotype of the male parent can be either C1 C2/ DD or C,D2/D,Cy, 
which is to be determined by linkage analysis in the progeny population. Considering 
two possible phases in both parents, four linkage phases can be distinguished, which 
are represented by linkage phases I to IV, respectively (table 7.1). 
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Tas. 7.1 — Four possible linkage phases in two heterozygous parents at two linked loci. 


Linkage phase Female parent (genotype A,;B,A2By) Male parent (genotype C,D,C D2) 


Phase I Ay Ao/B, Bo G:C) D,D 
Phase II Ay Ao/B,Bo G;Ds/ DG 
Phase III A, Bo/ By A> Cee 
Phase IV A, Bo/ By A> CDs) Deh 


Under the assumption of linkage phase I, figure 7.2 shows the procedure of 
population development starting from two heterozygous parents. The two parents 
produce gametes first, and then the female and male gametes are randomly com- 
bined to generate the F; progenies. Assume A,, By, C1, and D; are the four alleles at 
locus 1; Ay, By, Cy, and D, are the four alleles at locus 2. Two haploid types in the 
female parent are A, A» and B, Bə, two haploid types in the male parent are C1 Cy and 
D,D. Namely, the genotype of the female parent is 4:43 Bı Bo, and the genotype of 
the male parent is C,C,/D Dp. 

In figure 7.2, the female parent produces four gametes, i.e., 4:4, A; By, By Ao, 
and B, Bə, and their frequencies depend on recombination frequency between locus 1 
and locus 2 in the female parent, i.e., rr, male parent produces four gametes, i.e., 
Cı Cə, C1Də, D Cy and DiD», and their frequencies depend on recombination fre- 
quency between locus 1 and locus 2 in the male parent, i.e., rm. The random 
combination between female and male gametes will generate 16 genotypes in the 
progeny, and their theoretical frequencies are functions of rr and ry. Based on the 
theoretical frequencies of genotypes, rp and ru can be estimated from observed 
numbers of the 16 genotypes in the progeny. Assuming rr and ru are equal, one 
combined recombination frequency, i.e., r, can be estimated. In some species of 
animals, recombination frequency varies greatly between females and males. In these 
situations, rp and rx should be estimated separately, which can be used to construct 
the female and male linkage maps, respectively. 

Figure 7.2 shows one of the four linkage phases defined in table 7.1. The linkage 
phase of alleles at two loci in two heterozygous parents can be determined from the 
linkage analysis in their progenies, which is the major content of the next sections 
and §7.2. 


7.1.38 Estimation of the Recombination Frequency Between 
Two Fully-Informative Markers 


For the two parental genotypes, as shown in figure 7.2, there are four female 
gametes, where 4: 4, and BıBə are the parental (or non-recombinant) types each 
with frequency 5 (1 — ry), and A: 5ə and B, By are the recombinant (or non-parental) 
types each with frequency irp. There are four male gametes, where Ci Cə and D, Də 
are the parental types each with frequency $(1 — ry), and CD, and D;C are the 
recombinant types each with frequency TM. Four gametes are generated by each 
parent and their theoretical frequencies are exactly the same as those generated by 
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Female 41 Bı Cı Di Male 
parent parent 
4, B, Ci TD, 


Female x Male 
gamete gamete 
Genotype AA, A,B, Bil, BB, C.C, CİD, D,C, D,D, Genotype 
1 1 1 1 i i m 
Frequency 30-7) ë 7r 7r ür) ža- mm) 3m gö 307m) Frequency 


nee - | | | | | | | | | | | | | | | | | | | 


A,C, A\C, AİD, AiD, 4C, 4,C, 4D, 4,D, B,C, B,C, B,D, B,D, B,C, B,C, B,D, B,D, 
AC AD, AC, AyD, BC, B,D, B,C, B,D, AC, AyD, A,C, AyD, B,C, B,D, B,C, B,D, 


Genotype 


Fic. 7.2 — Graphic representation of the development of one hybrid F; population from two 
heterozygous parents. Notes: Two linked loci are considered, and both are completely infor- 
mative, i.e., Category I. 


the F, hybrid of two homozygous parents. The random combination between the 
female and male gametes generates 16 genotypes in progenies, and their theoretical 
frequencies are given in table 7.2. 

Let n be the sample size of the progeny population. Assuming there is no missing 
data, the observed sample sizes given in the last column of table 7.2 follow a 
multinomial distribution with a total sample size n and 16 random variables. Based 
on theoretical frequencies and the observed sample sizes given in table 7.2, the 
likelihood function (L) and logarithm likelihood function (InL) of the observed 
samples can be constructed and given in equation 7.1. 


1 m + ng + miş + nie 1 Tə + ng + nig + ns 


aaa mae! È mr)( — m) q.” TF)rM 


1 ns + ng + ng + N42 1 ng + ny + mo + ni 
-rp(1— — Tn 
iv ns) pr 


InD=C+ (ma =p Tu3:6) In(1 = rp) + 15.19 İn rp + (n + Na:5 + Ng:9 + N42:13 + nis) İn(1 = Tu) 
+ (m3 + 76:7 + No:11 + M14:15) İn rt 


(7.1) 


where C is a constant number independent of the unknown recombination 
frequencies, 74.4 is the summation of m, no, ng, and n4, and other symbols such as 
73.16 and N5.12 have a similar meaning. Maximum likelihood estimates (MLE) of 
recombination frequencies in female and male parents can be calculated by solving 
the likelihood equation i.e., dink — 0, which are given in equations 7.2 and 7.3, 


respectively. 
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TAB. 7.2 — Theoretical frequencies and observed sample sizes of the 16 identifiable genotypes 
at two fully-informative linked loci in the F, population from two heterozygous parents. 


Number Joint genotype Locus 1 Locus 2 Frequency Sample size 

1 A/G AQA AG 101-m)1—ni) m 

2 Aj A/C Də A Cı AD, a 1 — rr)rM no 

3 AADO AD 4:63 i-r ns 

4 Ay Ao/ D,Ds AD AD ; 1 — m)(1— m) ma 

5 A, Bo/C,Ch AC BC im(l-ni) ns 

6 A, Bo/ C.Də AG, BD inm n 

7 AıBə/ Dı Cy ADı BC im TM Tir 

8 AıBə/ Dı Də AD, BD im(i-ni) ns 

9 B, 42/00 BO 4:6: iml- nə 

10 Bı A2/ C Də Bıcı AzD, 7 TM nio 

11 B, A/D Cy BD, AC) ; PTM miq 

12 B, A>/ D,D» B,D, AzD, a p(1— m) n2 

13 BiBil C(6 BC BG 10:-m)1-ni) ms 

14 B, B/CD B,C BDə i 1 — rr)rM m4 

15 Bı Bo/D\ Cy BD, B> C> i 1 — rF)rM ns 

16 By Bo/ D, Də B,D BoD; R i= e 

: ns: 
îp = ” (7.2) 
= N2:3 + 16:7 ... + Ti4:15 (7.3) 


For linkage phase I, the numerator in equation 7.2 is the observed sample size of 
the female recombinant gametes; the numerator in equation 7.3 is the observed 
sample size of the male recombinant gametes. Obviously, the genetic meaning of 
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female and male recombination frequencies estimated in the F, progeny of 
heterozygous parents is exactly the same as that in the F, population from 
homozygous parents. 

Theoretical recombination frequency is between 0 and 0.5. But due to sampling 
errors, when two loci are very close, recombinant gametes may not be observed in 
the population, resulting in an estimate of 0. When two loci are far away in one 
chromosome or not linked, the estimate can be even greater than 0.5. But the 
estimated value should not deviate too far from 0.5. In other words, for two closely 
linked loci in linkage phase I, the estimates from equations 7.2 and 7.3 should be 
obviously lower than 0.5. For two loci not closely linked or unlinked, the two esti- 
mates should be around 0.5. 

Equations 7.2 and 7.3 are derived from linkage phase I. In the case of linkage 
phase II (table 7.1), the numerator of equation 7.2 is still the observed sample size of 
the female recombinant gametes, and the equation still gives the estimate of rp. 
However, the numerator of equation 7.3 becomes the observed sample size of the male 
non-recombinant gametes, and the equation gives the estimate of 1 — ru. In the case 
of linkage phase III (table 7.1), equation 7.2 gives the estimate of 1 — rp and equa- 
tion 7.3 gives the estimate of ry. In the case of linkage phase IV (table 7.1), equa- 
tion 7.2 gives the estimate of 1 — rp, and equation 7.3 gives the estimate of 1 — ry. 
For convenience, no matter what the true linkage phase would be, only phase I is 
considered first, and equations 7.2 and 7.3 are used to estimate the two recombina- 
tion frequencies. Then, the true linkage phases in two heterozygous parents 
are determined by comparing the estimated recombination frequencies with 0.5, 
i.e., equation 7.4. The combined recombination frequency r can be estimated at 
the same time when the true linkage phase is identified, i.e., being equal to the 
average value of female and male recombination frequencies estimated at the 
identified linkage phases (equation 7.4). 


“s en ee ae 
2 27 Fr] — İŞ” Pu 
1 
(fr + 7) Ty < Ü.5,ftu < 0.5 (linkage phase 1) 
1 1 
sirt5(—iu) ĉe <0.5,ôu > 0.5 (linkage phase II) (7.4) 
31 1 
5 — ip) + 3 iM tp > 0.5,7 < 0.5 (linkage phase III) 


1 
1-5 (ir +i) tp > 0.5,7 > 0.5 (linkage phase IV) 


7.1.4 Haploid Type Rebuilding in the Heterozygous 
Parents 


In equation 7.4, 7p is calculated by equation 7.2; îm is calculated by equation 7.3. 
The four scenarios in the estimation of combined recombination frequency in 
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equation 7.4 actually represents the four linkage phases defined in table 7.1. In this 
way, the two estimates are given for linkage phase I can be properly used to 
determine the true linkage phase in heterozygous parents, and in the meantime, the 
estimate of combined recombination frequency can be assured to be between 0 and 
0.5. For a number of linked markers, the estimate of combined recombination fre- 
quency, i.e., 7, and the ordering algorithms introduced in chapter 3 can be used to 
build the combined linkage map. There may be some difference between the female 
and male recombination frequencies, but generally speaking the order of genetic loci 
on one chromosome should be the same in both parents. For this reason, only the 
estimates of combined recombination frequencies are used in linkage map con- 
struction in the integrated software GACD (Zhang et al., 2015c). On the female and 
male maps, markers have exactly the same order as those on the combined map, but 
the map distance is calculated from 7p and îm, respectively. 

Take five Category I and linked markers (denoted as M,—M;) as examples to 
illustrate the estimation of recombination frequencies, determination of unknown 
linkage phase, and rebuilding of the parental haploid types. The population size is 
200, and the order of the five markers is the same as the assigned number. Observed 
sample sizes of the 16 marker types at two marker loci and their estimated recom- 
bination frequencies are given in table 7.3. The order of the 16 genotypes in table 7.3 
is the same as that in table 7.2. For example, for markers M: and Mə, the estimated 
female recombination frequency is equal to 0.150, much lower than 0.5; while the 
estimated male recombination frequency is equal to 0.890, much higher than 0.5. 
Such a situation corresponds to linkage phase II, based on the criterion given in 
equation 7.4. For markers M, and Ms, the estimated female and male recombination 
frequencies are both lower than 0.5, corresponding to linkage phase I, based on the 
criterion given in equation 7.4. 

Using the estimates of combined recombination frequencies given in the last row 
in table 7.3, the order of the five markers can be identified to be M,;-My-M3-M,-Ms. 
Based on the constructed linkage map, the four haploid types in the two 
heterozygous parents can be rebuilt. Two haploid types in the female parent are 
called HapA and HapB; those in the male parent are called HapC and HapD. Firstly, 
at the first ordered locus M4, alleles A-D are assigned to HapA—HapD, accordingly 
(table 7.4). Then for the second ordered locus Mb, the estimates of rp and ru with 
the previous locus M) are equal to 0.900 and 0.855 (table 7.3), both larger than 0.5, 
corresponding to linkage phase TV. At this locus, allele B is assigned to HapA; allele 
A is assigned to HapB; allele Dis assigned to HapC; and allele Cis assigned to HapD 
(table 7.4). For the third ordered locus Mə, the estimates of rr and ru with the 
previous locus Mə are equal to 0.150 and 0.890 (table 7.3), corresponding to linkage 
phase II. At this locus, allele B is assigned to HapA; allele A is assigned to HapB; 
allele C is assigned to HapC; and allele D is assigned to HapD (table 7.4). For the 
fourth ordered locus My, the estimates of rp and rm with the previous locus Mə, are 
equal to 0.090 and 0.800 (table 7.3), corresponding to linkage phase II. At this locus, 
allele B is assigned to HapA; allele A is assigned to HapB; allele D is assigned to 
HapC; allele C is assigned to HapD (table 7.4). For the fifth ordered locus Ms, the 
estimates of rp and ru with the previous locus My are both lower than 0.5 (table 7.3), 
corresponding to linkage phase I. At this locus, allele B is assigned to HapA; allele 


TAB. 7.3 — Observed sample sizes of the 16 genotypes at two marker loci and their estimated recombination frequencies for five Category I and 
linked markers. 


Genotype M M> M Mọ MM MM MM MM MM M My MM MM 


AC_AC 1 5 3 5 4 29 26 if 10 41 
AC_AD 1 3 8 11 39 8 12 35 27 4 
AD_AC 6 4 8 8 36 9 11 38 34 3 
AD_AD 2 7 4 3 5 27 21 5 9 34 
AC_BC 10 29 13 14 0 10 10 1 1 3 
AC_BD 38 13 26 20 9 5 4 9 0 
AD_BC 37 6 26 20 9 4 6 3 4 0 
AD_BD 3 31 10 17 0 10 12 3 2 A 
BC_AC 4 29 11 13 1 8 8 0 1 3 
BC_AD 41 8 21 17 5 2 2 3 8 0 

BD AC 41 9 26 23 6 2 4 3 4 2 
BD AD 6 31 11 14 0 T 10 1 1 T 

BC BC 1 10 5 6 5 28 23 10 15 41 
BC BD 3 2 12 13 41 14 19 34 23 10 
BD_BC 4 2 10 10 33 12 11 40 30 6 
BD_BD 2 11 6 6 T 25 21 13 22 39 

İF 0.900 0.780 0.720 0.690 0.150 0.240 0.280 0.090 0.150 0.110 
İM 0.855 0.235 0.685 0.610 0.890 0.280 0.345 0.800 0.695 0.125 
T 0.123 0.228 0.298 0.350 0.130 0.260 0.313 0.145 0.228 0.118 


TÜR 


Surddeyy əuər) pue sısAyeuy əSeyur? 
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A is assigned to HapB; allele D is assigned to HapC; allele C is assigned to HapD 
(table 7.4). Therefore, the distribution of four alleles on two homologous chromo- 
somes in both parents can be determined, and the procedure to determine the 
linkage relationship of alleles is called haploid type rebuilding. 


TAB. 7.4 — Haploid type rebuilding in both heterozygous parents at five Category I and linked 
markers. 


Marker locus Linkage phase with previous marker HapA HapB HapC HapD 


M, The first marker A B C D 
Mə Linkage phase IV B A D C 
M3 Linkage phase II B A C D 
M. Linkage phase IT B A D C 
Ms Linkage phase 1 B A D C 


Based on the previous example of haploid type rebuilding, the general procedure 
can be summarized as follows. Starting from the first marker on each chromosome, 
four alleles A-D at the first locus are assigned to the four haploid types in order (i.e., 
M; in table 7.4). The same principle applies to the remaining markers by the order 
on the linkage map. Haploid types at one next locus are dependent on their true 
linkage phase with the previous locus. If the linkage phase with the previous locus is 
phase I, four haploid types at the current locus take the same values as the previous 
marker (for example, M; in table 7.4). If they are in linkage phase II, HapA and 
HapB take the same values as the previous marker, but HapC and HapD take 
opposite values as the previous marker (for example, Mg and MA in table 7.4). If they 
are in linkage phase III, HapA and HapB take opposite values as the previous 
marker, but HapC and HapD take the same values as the previous marker (no such 
case in table 7.4). If they are in linkage phase IV, HapA and HapB take opposite 
values as the previous marker, and HapC and HapD take opposite values as the 
previous marker either (for example, Mə in table 7.4). 


7.2 Estimation of the Recombination Frequency 
for Incompletely Informative Markers 


In the previous section, markers were firstly classified into four categories, i.e., 
ABCD, A = B, C = D, and AB = CD. For two markers both belonging to Cate- 
gory I or ABCD, §7.1 also introduced in detail the theoretical frequencies of 16 
identifiable genotypes, estimation of three recombination frequencies (i.e., female, 
male, and combined), determination of the unknown linkage phase, and rebuilding 
of the parental haploid types. Markers in Category ABCD provide the complete 
information needed in the genetic analysis of progenies, which are called fully 
informative, or complete markers. For the other three categories, some genotypes 
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cannot be separated, and therefore only two or three genotypes are identifiable in 
progenies, which are called incompletely informative markers, or incomplete mark- 
ers. In this section, the incomplete markers will be considered in linkage analysis. 
Due to the symmetry in estimation, the recombination frequency between locus 1 
and locus 2 is the same as that between locus 2 and locus 1. A total of 10 scenarios 
occur in a two-point linkage analysis. If one locus is Category IT, and the other one is 
Category III, the four genotypes at two linked loci cannot be separated at all in 
progenies. In this scenario, none of the three recombination frequencies rp, rm, and 
r can be estimated. Table 7.5 shows the other nine scenarios where at least one 
recombination frequency can be estimated. Detailed information on scenario 1, i.e., 
two complete markers, can be found in §7.1. For brevity, given in this section are the 
theoretical frequencies of identifiable genotypes and the estimation formulas in other 
scenarios including the incomplete markers, i.e., scenarios 2-9 in table 7.5. Proce- 
dures on the maximum likelihood estimation (MLE) of recombination frequencies 
are skipped. For some scenarios, the MLE of recombination frequencies is similar to 
equations 7.2 and 7.3. For other scenarios, the Newton algorithm introduced in 
chapter 2 has to be adopted. The first and second derivatives of the logarithm 
likelihood can be found in the supplementary files of Zhang et al. (2015a). 


TAB. 7.5 — Nine scenarios in the estimation of recombination frequency between two linked 
loci. 


Scenario Marker category Recombination frequency 
Locus 1 Locus 2 TF TM r 

1 I (ABCD) I (ABCD) J J J 

2 I (ABCD) II (A = B) J 

3 I (ABCD) Ill (C = D) J 

4 I (ABCD) IV (AB = CD) vv ev vi 

5 II (A = B) II (A = B) J 

6 II (A = B) IV (AB = CD) ev 

T Ill (C = D) Ill (C = D) J 

8 III (C = D) IV (AB = CD) Vi 

9 TV (AB — CD) TV (AB — CD) vi 


Note: The symbol y is used to indicate that the corresponding recombination frequency, i.e., 
Tr, Tv or r, can be estimated; symbol "2 is used to indicate that only half of the observed 
samples can be used in estimating the recombination frequency. 


7.2.1 Theoretical Frequencies of Identifiable Genotypes 
Between the Complete Marker and Other Three 
Categories of Markers 


If locus 1 is Category I, and locus 2 is Category II or II, locus 1 has four identifiable 
genotypes, but locus 2 only has two identifiable genotypes. When both loci are 
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considered together, the eight identifiable genotypes are given in table 7.6, and each 
of them can be treated as a mixture of two of the 16 genotypes given in table 7.2. 
Taking the first identifiable genotype as an example, A; Cı Xə Cə represents the two 
genotypes A, C,A»C, and A, C; By C, at two complete markers, i.e., the first and fifth 
genotypes in table 7.2, with frequencies 1(1 — rp)(1 — rm) and }$7re(1—7™1), 
respectively. Therefore, the sum of the two genotypic frequencies is equal to 
10 — ry), which is the theoretical frequency of the identifiable genotype A: C,X2Cy 
in table 7.6. 

Scenario 2 represents no polymorphism at locus 2 in the female parent, and 
Scenario 3 represents no polymorphism at locus 2 in the male parent. In scenario 2, 
theoretical frequencies do not contain the female recombination frequency, and 
therefore rp cannot be estimated (table 7.6). In Scenario 3, theoretical frequencies do 
not contain the male recombination frequency, and therefore rx cannot be estimated 
(table 7.6). Let n be the size of the progeny population, and mn be the sample 
sizes of eight identifiable genotypes given in table 7.6. The male and combined 
recombination frequency in Scenario 2 can be estimated by equations 7.5 and 7.6, 
respectively. 


™ = sı (Scenario 2) (7.5) 
n 
. 1:1, îm TM. x 0.5 (linkage phases I or II ) 
775655. | (7.6 
2 2 1—7”u fu > 0.5 (linkage phases II or IV ) 


The female and combined recombination frequency in Scenario 3 can be esti- 
mated by equations 7.7 and 7.8, respectively. 


ip = 777: (Scenario 3) (7.7) 
n 
. 11, Ty Tp 0.5 (linkage phases I or II) 
T— ———— Tp] = 2 R : (7.8) 
2 2 1— fr. 7p 5 0:5 (linkage phases III or IV) 


If locus 1 is Category I, and locus 2 is Category IV, locus 1 has four genotypes, 
but locus 2 only has three identifiable genotypes. When both loci are considered 
together, the 12 identifiable genotypes are given in table 7.7, and their theoretical 
frequencies are given in the fourth column of the table. Female and male recombi- 
nation frequencies are confounded in the frequencies of the second, fifth, eighth, and 
eleventh identifiable genotypes, and these genotypes make up half of the progeny 
population. Both female and male recombination frequencies can still be estimated 
by the other eight identifiable genotypes. Let n\—n2 be the sample sizes of the 12 
identifiable genotypes in table 7.7. Female and male recombination frequencies in 
Scenario 5 can be estimated by equations 7.9 and 7.10, respectively. Then the linkage 
phase can be determined, and the combined recombination frequency can be esti- 
mated by the same method as given in equation 7.4. 
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TAB. 7.6 — Theoretical frequencies of eight identifiable genotypes in scenarios 2 and 3. 


Identifiable Locus 1 Scenario 2 (table 7.5) Scenario 3 (table 7.5) 
genotype (Category I) Locus 2 Frequency Locus 2 Frequency 
(Category II, (Category II, 
Xə — Aş or Bə) Xə = Cə OT Də) 
1 A, C, Xə Cə a-m) Ao Xo Ea- m) 
2 A0, XD; i i BəXə n . 
3 AiD, XC r Mi A2X> nu =, 
4 AiD, XD, r — nı) B% a 
5 Bı Cı Xə Cə r — rm) A Xə in 
6 BiG, XD, i ə BəXə nu əbi 
7 B,D, XC» T 2: m 
4 4 
8 B,D XD r — nı) B% r =, 
i a + 6:7 bn : 
ip = Sa (Scenario 4) (7.9) 
m + 73:4 + 6:7 + Ng:10 + Tə 
2 3:4 + Ng: . 
770 (Scenario 4) (7.10) 


TM = 
Tü + 3:4 + 6:7 + N9:10 + N12 


To fully exploit the sampling information, the combined recombination fre- 
quency may also be estimated from the theoretical frequencies given in the second 
part of table 7.7, once the linkage phase is determined from the two estimates given 
by equations 7.9 and 7.10. The latter option takes the advantage of all observed 
individuals in the sampling population and therefore the estimation accuracy may 
be improved on the combined recombination frequency. 


7.2.2 Theoretical Frequencies of Identifiable Genotypes 
Between Two Markers Belonging to Category II, III, 
or IV 


If both loci are Category II or III, each locus has two identifiable genotypes, leading 
to four genotypes when both loci are considered together. Scenario 5 represents that 
both markers have no polymorphism in the female parent. It can be seen from 


Identifiable 
genotype 


11 


12 


TAB. 7.7 — Theoretical frequencies of the 12 identifiable genotypes for Scenario 4. 


Locus 1 
(Category 


IV) 


Ay Ap 
AzBə 
BəBə 
Asis 
AzBə 
BəBə 
Ap Ap 
AB» 
B2Bə 
AzAz 
AB, 


B>B> 


(1 — re)(1 — mm) 


Theoretical frequency depending on the combined recombination 
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table 7.8 that the female recombination frequency is not included in the theoretical 
genotypic frequencies. Therefore, only the male recombination frequency can be 
estimated in Scenario 5. Scenario 7 represents both markers have no polymorphism 
in the male parent. The male recombination frequency is not included in the theo- 
retical genotypic frequencies. Therefore, only the female recombination frequency 
can be estimated in Scenario 7. 


TAB. 7.8 — Theoretical frequencies of the four identifiable genotypes for Scenarios 5 and 7. 


Identifiable Scenario 5 (table 7.5) Scenario 7 (table 7.5) 
genotype Locus 1 Locus 2 Frequency Locus 1 Locus 2 Frequency 
(Category (Category (Category (Category 
I, IL II, II, 
XI — Ay Xə — Aş X% = C or Xə — Cə OT 
OT Bı) or Bə) D ) Də) 
1 1 
1 XıCı XəCə z0 - uy) AX A.X? z0 — rr) 
1 1 
2 XO XD, 3™ AX, BX» 5/F 
1 1 
3 XıDı XC, 2 TM BX, A2Xə 2 TF 
1 1 
4 XD, XəD, z0 —m) BX By Xz z0 — rr) 


Let mı—nq be the sample sizes of the four identifiable genotypes in table 7.8. It is 
not hard to find that the male recombination frequency in Scenario 5 and the female 
recombination frequency in Scenario 7 can be estimated by equations 7.11 and 7.12, 
respectively. The combined recombination frequency and linkage phase can be cal- 
culated by equations 7.6 and 7.8 for Scenarios 5 and 7, respectively. 

. n2:3 


M=- (Scenario 5) (7.11) 


ip — 2 (Scenario 7) (7.12) 
n 


If locus 1 is Category II or III, and locus 2 is Category IV, locus 1 has two 
identifiable genotypes, and locus 2 has three genotypes, leading to six genotypes 
when both loci are considered together. It can be seen from table 7.9, the theoretical 
frequencies of the second and fifth genotypes are both equal to one quarter, taking 
half of the population but providing no information in estimating the recombination 
frequency. The male recombination frequency in Scenario 6 and the female recom- 
bination frequency in Scenario 8 can be estimated by the other four genotypes. As a 
matter of fact, the two scenarios are similar to scenarios 2 and 3, respectively. 
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TAB. 7.9 — Theoretical frequencies of the six identifiable genotypes for Scenarios 6 and 8. 


Identifiable Locus 1 Locus 2 Frequency 
genotype Scenario 6 Scenario 8 Scenario 6 Scenario 8 
(table 7.5) (table 7.5) (table 7.5) (table 7.5) 
1 1 
1 XI Ci A,X AA» rı (1 = TM) qü əz rr) 
2 XC A,X. AaB : l 
1 21 252 1 1 
1 1 
3 XC, A,X, By By 1 TM ri TF 
1 1 
4 XD, BX, Az Az mi TM r TF 
5 X,D BX AB.: : g 
1D, 1X1 252 1 1 
1 1 
6 XıD, BX, BoB, 1 (1 — m) qü — rr) 


Let mne be the sample sizes of the six identifiable genotypes in table 7.9. It is 
not hard to find that the male recombination frequency in Scenario 6 can be esti- 
mated by equation 7.13. Then the linkage phase can be determined, and the com- 
bined recombination frequency can be estimated by the same method as given in 
equation 7.6. The female recombination frequency in Scenario 8 can be estimated by 
equation 7.14. Then the linkage phase can be determined, and the combined 
recombination frequency can be estimated by the same method as given in 
equation 7.8. 


iy = ——24 — (Scenario 6) (7.13) 
My 13:4 T Tİ 


fp = ——34 ___ (Scenario 8) (7.14) 


Mm T N34 T Tİ 


7.2.3 Theoretical Frequencies of Identifiable Genotypes 
Between Two Category IV Markers 


If two marker loci both are Category IV, each locus has three identifiable genotypes, 
leading to nine identifiable genotypes when both markers are considered together, 
i.e., Scenario 9 in table 7.5. Since the two parents have exactly the same genotype, it 
is impossible to distinguish the female and male recombination frequencies. Only the 
combined recombination frequency can be estimated in Scenario 9 (table 7.5). In 
fact, this scenario is not unique in genetics. Suppose one F> population is derived 
from two inbred parents and then used in a genetic study. The Fə population is 
screened by a set of codominant markers, but unfortunately, the genotypic data is 
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not available for the two inbred parents. The reason could be that the two parents 
are not included in genotyping by mistake, or the two inbred parents are not 
available at all. For example, one F» population is developed by the selfing of a 
commercial maize F, hybrid, and therefore the two parental lines may not be 
available. Such an Fə population can be treated as a special case of the F; hybrid 
from two heterozygous parents, i.e., the commercial hybrid F; is used as both the 
female and male parents, and all codominant markers used in genotyping belong to 
Category IV. Without genotyping the two inbred parents, the linkage phase in the 
commercial hybrid F, becomes unknown, and the genetic analysis methods intro- 
duced in chapter 2 are not applicable to such an Fə population. 

For two markers both of Category IV (i.e., Scenario 9 in table 7.5), genotypes of 
the two heterozygous parents are denoted as A; B,A2Bo, and theoretical frequencies 
of the nine genotypes in progenies are given in table 7.10 each at the four possible 
linkage phases. At linkage phase I, female and male parents both have the same 
genotype A,A»/B,By. At linkage phases IT and IT, the two parents have genotypes 
A, Aə/ Bı By, and A;B,/B,Az2, respectively. Linkage phases II and III in Scenario 9 
give identical theoretical genotypic frequencies and therefore are equivalent 
(table 7.10). At linkage phase IV, female and male parents both have the same 
genotype 4:53 / Bı Aş. It can be seen from table 7.10 that the theoretical frequencies 
at linkage phase IV can be obtained by replacing r with 1 — r, and 1 — r with rin the 
theoretical frequencies at linkage phase I, and vice versa. 

When the true linkage phase is I, an estimate of r can be acquired from the 
theoretical frequencies at linkage phase I (table 7.10), and an estimate of 1 — r can be 
acquired from the theoretical frequencies at linkage phase IV (table 7.10). The true 
linkage phase can be therefore determined by comparing 7 estimated at linkage phase 
I with 0.5. That is why, the true phase is I if 7 < 0.5; otherwise, the true phase is IV. 
However, it is more complicated to determine when the true linkage phase is II or II. 

Take the true recombination frequency r = 0.1 as an example to explain how the 
linkage phase can be determined in Scenario 9. For this purpose, theoretical geno- 
typic frequencies at the four possible linkage phases are given in the second part of 
table 7.10 for r = 0.1. Let p;(r) (i = 1, 2,...,9) represent the theoretical genotypic 
frequency depending on r, and p;(0.1) (i = 1, 2,...,9) represent the genotypic fre- 
quency at r — 0.1. Assume the population size is equal to one, and the sample sizes 
of the nine genotypes are equal to their theoretical frequencies. The logarithm 
likelihood function is given in equation 7.15, by ignoring the terms which are 
dependent on sample sizes but independent of the recombination frequency to be 
estimated. The ignored terms are similar to the constant number Cin equation 7.1. 


9 
In L(r) x X. p;(0.1) In p;(r) (7.15) 


i=1 


By changing the value of r in equation 7.15, profiles of the likelihood function 
when true recombination frequency is equal to 0.1 can be acquired at the four 
possible linkage phases, as shown in figure 7.3. If the true linkage phase is I, it can be 
seen from figure 7.3A that the highest likelihood is around 0.1 (£.e., r) only for 
linkage phase I. The highest likelihood is around 0.9 (i.e., 1 — r) for linkage phase IV, 


Identifiable 
genotype 


Locus 1 


ALA, 
AA: 
AA: 
A,B, 
A,B, 
A,B 
Bı Bı 
BB: 


Bibi 


TAB. 7.10 — Theoretical frequencies of the nine identifiable genotypes in Scenario 9. 


Locus 2 


Ap Ay 


Azə 


BəBə 


AzAş 


Azə 


By Bə 


A24ş 


AB 


By Bo 


Theoretical frequency depending on the combined 


recombination frequency r 


Phase I 

1 : 
40- öl 

1 

gr —r) 

l; 

4 

1 

əTü —r) 

1 2 
z — 2r+ 2r“) 
1 

gr —r) 

1. 

a” 

1 

gr —r) 

1 : 
40- ry 


Phases II and II 


1 

10- r) 

1 2 
q(t 72rt2r ) 
1 

=r(1— 

Lra- r) 

1 2 
207 2772 ) 
r(1 — r) 

1 2 
1 

gra r) 

1 2 
1 

gra r) 


Phase IV 

r 

ə 

1 

gr r) 

1 

“0. 

1 

z- r) 

1 2 
z7 art ar’) 
1 

gr- r) 

1 

rica 

1 

gr - r) 

1 
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Theoretical frequency when r = 0.1 


Phase I 


0.2025 


0.045 


0.0025 


0.045 


0.41 


0.045 


0.0025 


0.045 


0.2025 


Phases II and II 


0.0225 


0.205 


0.0225 


0.205 


0.09 


0.205 


0.0225 


0.205 


0.0225 


Phase IV 


0.0025 


0.045 


0.2025 


0.045 


0.41 


0.045 


0.2025 


0.045 


0.0025 
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Fic. 7.3 — Profiles of likelihood functions at the four possible linkage phases when true 
recombination frequency is equal to 0.1. Notes: (A) the true linkage phase is I; (B) the true 
linkage phase is IV; (C) the true linkage phase is II or III. 


and around 0.5 for linkage phases II and III (figure 7.3A). If the true linkage phase is 
IV, the highest likelihood is around 0.1 (£.e., r) only for linkage phase IV. The 
highest likelihood is around 0.9 (i.e., 1— r) for linkage phase I, and around 0.5 for 
linkage phases II and III (figure 7.3B). If the true linkage phase is II or III, there are 
two highest likelihood values around 0.1 (£.e., r) and 0.9 (i.e., 1-— r) for linkage 
phases IT and III. The highest likelihood is around 0.5 for linkage phases I and IV 
(figure 7.3C). The above observations from figure 7.3 indicate the unknown linkage 
phase can still be determined by comparing the estimates from the four possible 
linkage phases. 

Let n; (or ) (4 = 1, 2,...,9) be the observed sample size (or observed frequency) 
of the ith genotype given in table 7.10. Replace p,(0.1) in equation 7.15 with n; to 
have the likelihood function when the population size is equal to n (= mış). For 
linkage phases I and IV, some cubic terms of r are included in the likelihood 
equation, and the iteration algorithm has to be adopted to find the solution. For 
linkage phases IT and III, a quadratic likelihood equation is acquired with two real 
roots between 0 and 1, as given in equation 7.16 (see exercise 7.10). In fact, for the 
two roots given in equation 7.16, one is lower than 0.5, and the other is higher than 
0.5, corresponding to the two maxima of the solid profile observed in figure 7.3C. 


1 H + ne 4 1 : 5 c R R 
P= 5 (: + y 2777770 E EN ) (Scenario 9, linkage phases TI and III) 
n 


(7.16) 
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In summary, the true linkage phase regarding Scenario 9 can be determined by 
the following method. If the MLE of recombination frequency is much lower than 0.5 
at linkage phase I, much higher than 0.5 at linkage phase IV, and around 0.5 at 
linkage phases IT and III, the true linkage phase is identified to be phase I. If the 
MLE of recombination frequency is much higher than 0.5 at linkage phase I, much 
lower than 0.5 at linkage phase IV, and around 0.5 at linkage phases II and III, the 
true linkage phase is identified to be a phase IV. If the MLE of recombination 
frequency is around 0.5 at linkage phases I and IV, much lower than 0.5 at one 
between linkage phases II and II, and much higher than 0.5 in the other one 
between linkage phases IT and III, the true linkage phase is identified to be either 
phase II or III. In short, the one with the lowest estimate of recombination frequency 
among the four possible phases is identified to be the true linkage phase, and the 
lowest estimate is assigned as the MLE of recombination frequency 7. 

To keep consistent with other scenarios, estimates of rr and ru need to be given 
when the linkage phase has been determined, and the combined recombination fre- 
quency has been estimated. When linkage phase I is assigned to be the true phase, 
estimates of rp and ry are both equal to the estimate of r at linkage phase I, i.e., 
equation 7.17. When linkage phase II is assigned to be the true phase, estimates of rp 
and ru are equal to the estimates of r and 1-— r, respectively, at linkage phase II, 
i.e., equation 7.18. When linkage phase ITI is assigned to be the true phase, estimates 
of rp and ry are equal to the estimates of 1 — rand r, respectively, at linkage phase III, 
i.e., equation 7.19. When linkage phase IV is assigned to be the true phase, estimates 
of rr and ry are equal to the estimate of 1 — r at linkage phase IV, i.e., equation 7.20. 


ip = 7, fi = f (linkage phase I is true) (7.17) 

ip = 7, fi = 1-7, (linkage phase II is true) (7.18) 
fp = 1 — f, fu = 7 (linkage phase ITI is true) (7.19) 
fp = 1—7, îm = 1 — ° (linkage phase IV is true) (7.20) 


In reality, Category IV (i.e., AB=CD) locus may be caused in two ways. Firstly, 
allele A is the same as allele C, and allele B is the same as allele D. Secondly, allele 
A is the same as allele D, and allele B is the same as allele C. The two ways are 
denoted as Category A=CB=D and Category A=DB=C, respectively. If the true 
linkage phase is I or IV between one Category IV marker and its previous one of any 
category, the same parental genotype at the Category IV locus can only be caused 
by the way that allele A is the same as allele C, and allele B is the same as 
allele D. The Category IV marker can be further classified as Category A=CB=D. If 
the true linkage phase is II or HI between one Category IV marker and its 
previous one of any category, the same parental genotype at the Category IV 
locus can only be caused by the way that allele A is the same as allele D, and allele 
B is the same as allele C. The Category IV marker can be further classified as 
Category A=DB=C. Therefore, by linkage analysis, Category AB=CD can be 
further divided into two categories A=CB=D and A=DB=C, which are actually 
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Categories IV and V in the double cross F; populations from four pure line parents 
to be introduced in §7.3. 


7.2.4 Haploid Type Rebuilding at the Presence 
of All Categories of Markers 


By the order on one chromosome, the categories of 20 markers, estimates of three 
recombination frequencies between neighboring markers, and four haploid types 
rebuilt in two heterozygous parents are given in table 7.11. The size of the progeny 
population is 200, and the 20 markers are ordered by the combined recombination 
frequency. Each of the four categories has five markers that are randomly located on 
the chromosome. Estimation of the three recombination frequencies between any 
two categories of markers has been discussed in detail in §7.1.3, and §7.2.1—§7.2.3. 
The procedure for rebuilding the four parental haploid types in the presence of all 
categories of markers is summarized as follows, always taking table 7.11 as an 
example. 

For the female parent, markers in Category A = B have no polymorphism. 
Symbol X is assigned to two haploid types of the female parent (7.e., HapA and 
HapB) at these loci firstly, such as Mi, Mə, Mas, Miş, and Mag in table 7.11. By the 
way, allele X will be imputed as either C or D by an imputation algorithm for 
incomplete and missing markers, which will be introduced in §7.3.4. Considered next 
are the first two neighboring markers where rp can be estimated, i.e., M3 and My in 
table 7.11. At the first locus (i.e., Mg in table 7.11), alleles A and B are always 
assigned to HapA and HapB, respectively. At the second locus, alleles A and B are 
also assigned to HapA and HapB, respectively, if fy x 0.5 (i.e., M4 in table 7.11); 
alleles B and A are assigned to HapA and HapB, respectively, if 7 > 0.5 (no such 
case in table 7.11). Considered then is the next marker where rp can be estimated. If 
Ty <0.5, alleles at HapA and HapB take the same values as its previous locus where 
rr can be estimated, such as M7, Myo, Mp: and so on in table 7.11; if 7 > 0.5, alleles 
at HapA and HapB take the opposite values as its previous locus where rp can be 
estimated, such as Ms, Mg, Mg and so on in table 7.11. Repeat the process until 
the last ordered locus on the chromosome where rp can be estimated. And finally, 
the two haploid types are rebuilt for the female parent. 

For the male parent, markers in Category C = D have no polymorphism. Symbol 
X is assigned to two haploid types of the male parent (i.e., HapC and HapD) at these 
loci firstly, such as My, Ms, M7, Mg, and Miş in table 7.11. Allele X will be imputed as 
either A or B by an imputation algorithm for incomplete and missing markers, which 
will be introduced in §7.3.4. Next considered are the first two markers where ry can be 
estimated, i.e., Mı and My in table 7.11. At the first locus (£.e., Mi in table 7.11), alleles 
Cand D are always assigned to HapC and HapD, respectively. At the second locus, 
alleles Cand D are also assigned to HapC and HapD, respectively, if fu <0.5 (i.e., Mo 
in table 7.11); alleles D and C are assigned to HapC and HapD, respectively, if 
îm > 0.5 (no such case in table 7.11). Considered then is the next marker where ry can 
be estimated. If .<0.5, alleles at HapC and HapD take the same values as its 
previous locus where rx can be estimated, such as Mg, Mg, M11, and so on in table 7.11; 


TAB. 7.11 — Estimation of recombination frequency and the rebuilding of parental haploid types for 20 markers belonging to different 
categories and linked on one chromosome. 


Marker 


M1 
M2 
M3 
M4 
M5 
M6 
M7 
M8 
M9 
M10 
M11 
M12 
M13 
M14 
M15 
M16 
M17 
M18 
M19 
M20 


Category 


A=B 
A=B 
AB=CD 
C=D 
C=D 
AB=CD 
C=D 
ABCD 
C=D 
AB=CD 
ABCD 
AB=CD 
A=B 
ABCD 
A=B 
AB=CD 
ABCD 
C=D 
ABCD 
A=B 


Recombination frequency 


Combined Female 
0.050 Inestimable 
0.040 Inestimable 
0.081 0.081 

0.025 0.975 

0.046 0.955 

0.034 0.034 

0.040 0.960 

0.030 0.970 

0.053 0.053 

0.049 0.042 

0.038 0.957 

0.043 Inestimable 
0.040 0.903 

0.080 Inestimable 
0.059 0.961 

0.067 0.951 

0.040 0.960 

0.070 0.930 

0.075 Inestimable 


Male 


0.050 
0.960 
Inestimable 
Inestimable 
0.126 
Inestimable 
0.091 
Inestimable 
0.926 
0.053 
0.989 
0.957 
0.960 
0.080 
0.941 
0.098 
Inestimable 
0.125 
0.925 


Female parent 


HapA HapB 
X X 
X X 
A B 
A B 
B A 
A B 
A B 
B A 
A B 
A B 
A B 
B A 
X X 
A B 
X X 
B A 
A B 
B A 
A B 
X X 


Male parent 


HapC HapD 
C D 
C D 
D C 
X X 
X X 
D C 
X X 
D C 
X X 
C D 
C D 
D C 
C D 
D C 
D C 
C D 
C D 
X X 
C D 
D C 


Updated category 


w 
Il 
O 


QEOQOQrFPIPF P 
Il 
OUUCQU 


=DB=C 
ABCD 
C=D 
A=CB=D 
ABCD 
A=CB=D 
A=B 
ABCD 
A=B 
A=DB=C 
ABCD 
C=D 
ABCD 
A=B 
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if fi, > 0.5, alleles at HapC and HapD take the opposite values as its previous locus in 
which ry can be estimated, such as M3, Mio, M12, Miş and so on in table 7.11. Repeat 
the process until the last ordered locus on the chromosome in which rm can be esti- 
mated. And finally, the two haploid types are rebuilt for the male parent. 

In table 7.11, HapA and HapB indicate the linkage relationship between alleles 
A and B at all polymorphic loci in the female parent; HapC and HapD indicate the 
linkage relationship between alleles Cand D in the male parent. For example, alleles 
Bat loci Ms, Ms, Mia, Mig, and Mig, and alleles A at the other loci are linked on one 
homologous chromosome in the female parent; alleles A at loci Ms, Mg, Mio, Mig, 
and Maş, and alleles B at the other loci are linked on the other homologous chro- 
mosome in the female parent. If the four haploid types are regarded as the 
homozygous genotypes of four pure-line parents which may be virtual, the F, hybrid 
between two heterozygous parents is identical to the double cross F, of four 
homozygous parents. 

While genotyping at each locus, A and Bare randomly given to represent the two 
polymorphisms in the female parent; C and D are randomly given to represent the 
two polymorphisms in the male parent. When the four haploid types are rebuilt from 
linkage analysis, HapA and HapB demonstrate the linkage relationship of alleles 
A and B at all loci on one chromosome; HapC and HapD demonstrate the linkage 
relationship of alleles C and D at all loci on the same chromosome. Due to the 
random assignment of alleles before genetic analysis, it can be seen from table 7.11 
that alleles A can be present on HapB at some loci, and alleles B can be present on 
HapA either; alleles C can be present on HapD at some loci, and alleles D can be 
present on HapC either. In fact, allele Bin HapA can be replaced with allele A, and 
in the meantime allele A in HapB can be replaced with allele B. After the replace- 
ment, alleles A at all loci will be linked in HapA, and alleles B at all loci will be 
linked in HapB. Similarly, allele Din HapC can be replaced with allele C, and in the 
meantime allele C in HapD can be replaced with allele D. After the replacement, 
alleles C at all loci will be linked in HapC, and alleles D at all loci will be linked in 
HapD. 

Figure 7.4 shows the three linkage maps built from 20 markers belonging to four 
different categories. The combined map has 20 markers with a length of 101.79 cM, 
where the Haldane mapping function is used to convert the estimated recombination 
frequency to map distance. Orders of markers in the female and male maps are the 
same as the order of the combined map. But the map distance between neighboring 
markers is estimated by the estimated female and male recombination frequencies, 
respectively. Therefore, the distances between the same two markers may not be 
exactly equal on the three maps, for example, the distance between Mj) and Mj}. 
The female map does not contain the Category II markers, and the male map does 
not contain the Category II markers. Mə is located at the beginning, and Maş is 
located at the end of the female map. The length of the female map is 81.90 cM, 
shorter than the combined map. M1 is located at the beginning and Mə is located at 
the end of the male map. The length of the male maps is 103.02 cM, similar to the 
length of the combined map. 
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Combined map Female map Male map 
M1 0.00 M3 0.00 M1 0.00 
M2 5.27 M2 5.27 

M4 8.81 
M3 9.48 M5 11.38 M3 9.48 
M6 16.14 
ms OU 19.68 
Mö 5563 M8 23.84 
M7 29.16 M9 26.94 me mə 
M3 33.33 M10 32.50 Mg 33.33 
M9 3642 M11 36.90 
M12 4140 M10 41.98 
2 mi [oar 
M12 51.09 M14 50.06 M12 43.63 
iH 553 M13 53.13 
rib ə M14 57.30 
M16 65.04 
M15 68.47 M17 70.20 sə nəə 
M16 74.73 M18 74.37 M16 1221 
M17 81.95 M19 5191 Mr 83.18 
M18 86.12 
M19 93.66 M19 94.89 
M20 101.79 M20 103.02 


Fic. 7.4 — The combined, female, and male linkage maps of 20 markers belonging to four 
different categories (see table 7.11). Notes: The size of the progeny population is 200. The map 
unit is cM. The Haldane mapping function is used to convert the estimated recombination 
frequency to map distance. 


7.3 Linkage Analysis in Double Cross F, Derived 
from Four Pure-Line Parents 


7.3.1 Marker Categories and Estimation of Recombination 
Frequency in the Double Cross F, Population 


Considering one marker or gene locus, genotypes of the four pure-line parents are 
denoted as AA, BB, CC, and DD. In genetics and breeding, pure lines are generated 
by regular inbreeding systems such as selfing and sib-mating (Falconer and Mackay, 
1996), and therefore are often called inbred lines or inbreds, especially in hybrid 
breeding programs. The genotype of the hybrid F; between parents A and B (called 
single cross AB) is AB, and the genotype of the hybrid F, between parents C and D 
(called single cross CD) is CD. The double cross F) population is generated by 
crossing the two F, hybrids, assuming single cross AB is used as the female parent, 
and single cross CD is used as the male parent. Whether the four genotypes AC, AD, 
BC, and BD can be completely or partially distinguished depends on the polymor- 
phism of the four alleles in the two parents and their progenies. Five categories can be 
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differentiated by the number of identifiable alleles in the four original parents and the 
number of identifiable genotypes in the double cross F, progenies (figure 7.5). The 
first three categories are the same as those in the F, hybrid population derived from 
two heterozygous parents (figure 7.1), which are still denoted as Categories I (or 
ABCD), II (or A = B), and İT (or C = D) (see figures 7.1 and 7.5). 

When genotypes of the four inbred parents are known, Category IV in figure 7.1 
can be further classified into two categories, which are denoted as Categories IV (or 
A = CB = D) and V (or A = DB = C). For Category IV, there is no polymorphism 
between parents A and C, and no polymorphism between parents B and D. For 
Category V, there is no polymorphism between parents A and D, and no polymor- 
phism between parents B and C. For markers belonging to the two categories, single 
crosses AB and CD have exactly the same heterozygous genotype, and the double 
cross F; progenies have three identifiable genotypes following the Mendelian ratio of 
1:2:1 (figure 7.5). Categories İV and V cannot be distinguished if genotypic data of 
the four inbred parents are not available, which is the case discussed in §7.1 and §7.2. 


Inbred Inbred Inbred Inbred Single Single : 

A B (əl D cross AB cross CD Double cross F, progenies 
Category 1 fmm a x v—- úġOģò 
or ABCD 1 = = 
Category II ' —-— — == 
orA=B 1 — > i 
! = =a i 
i A=B c D 4-B CD XC XD 1 
Category III — — > = i 
orC=D | -—— > = = 
| A B C=D AB C=D AX BX 1 
Category IV — — — = — 
or A=CB=D ! — — ! 
! — — — — — — ! 
A=C B=D AB=CD AA AB BB 
Category V ' — — — — — — 
or A=DB=C ! — — ! 
' A=D B=C AB=CD AA AB BB 1 


Fic. 7.5 — Five categories of polymorphism markers in four inbred parents and their double 
cross F; population. Notes: In Category I, four genotypes can be differentiated in the double 
cross F, progenies, following the Mendelian ratio of 1:1:1:1. In Category II, there is no 
polymorphism between parents A and B, and the two identifiable genotypes follow the 
Mendelian ratio of 1:1 in the double cross F, progenies. In Category III, there is no poly- 
morphism between parents C and D, and the two identifiable genotypes follow the Mendelian 
ratio of 1:1 in the double cross F, progenies. In Categories IV and V, the two single crosses 
have the same heterozygous genotype, and the three identifiable genotypes follow the Men- 
delian ratio of 1:2:1 in the double cross F; progenies. 
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Markers belonging to Category I provide the complete information in linkage 
analysis, which are called fully informative markers, or complete markers in short. 
Those belonging to the other four categories provide incomplete information in 
linkage analysis, which are called incompletely informative markers, or incomplete 
markers in short. 

When genotypes of the four inbred parents and their double cross F; progenies are 
both available, the four alleles A, B, C, and D at each locus come from parents A, B, 
C, and D, respectively. In single cross AB, alleles A at all loci are linked in one 
homologous chromosome, and alleles B at all loci are linked in the other homologous 
chromosome. In single cross CD, alleles Cat all loci are linked in one homologous 
chromosome, and alleles D at all loci are linked in the other homologous chromosome. 
Therefore, linkage phases are known in the two single crosses, which are equivalent to 
linkage phase I introduced in §7.1 and §7.2. The other three linkage phases are not 
applicable unless the genotypic data of four pure-line parents are not available. 

For two complete markers, the theoretical frequencies of the 16 genotypes are 
exactly the same as those given in table 7.2. A total of 15 scenarios have to be 
considered in recombination frequency estimation for the five categories of markers. 
If one marker is Category I, and the other one is Category II, none of the three 
recombination frequencies rp, rm, and r can be estimated. Table 7.12 shows the 14 
scenarios where at least one of the three recombination frequencies can be estimated. 


TAB. 7.12 — Fourteen scenarios in the estimation of recombination frequency between two 
linked loci in the double cross F: population. 


Marker category Tables giving Recombination 
the theoretical frequency 

Locus 1 Locus 2 genotypic Tp TM r 
frequencies 

I (ABCD) I (ABCD) Table 7.2 v v J 

I (ABCD) II (A = B) Table 7.6, Scenario 2 J 

I (ABCD) Ill (C = D) Table 7.6, Scenario 3 v 

I (ABCD) IV (A = CB =D) Table 7.7, linkage phase 1 “Y/ “4 vy 

I (ABCD) V (A =DB=C Table 7.7, linkage phase 11 %/ 4 vy 

II (A = B) I (A =B) Table 7.8, Scenario 5 v 

II (A = B) IV (A = CB = D) Table 7.9, Scenario 6 Və 

II (A =B) V (A=DB=C Table 7.13 vv 

II (C = D) II (C = D) Table 7.8, Scenario 7 A 

Ill (C = D) IV (A = CB = D) Table 7.9, Scenario 8 ev 

Ill (C = D) V (A =DB=C Table 7.13 Və 

IV (A=CB=D) IV(A=CB=D)_ Table 7.10, linkage phase I vi 

IV (A=CB=D) V(A=DB=C Table 7.10, linkage v 
phases IT and IIT 

V(A=DB=C) V(A=DB=C Table 7.10, linkage phase I J 


Note: The symbol y is used to indicate that the corresponding recombination frequency, i.e., 
Tp, Ti or r, can be estimated; the symbol 1⁄2 is used to indicate that only half of the observed 
samples can be used in estimating the recombination frequency. 
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In the double cross F; population where the linkage phases are always known, the 
three recombination frequencies should be between 0 and 0.5 in theory. Estimates of 
the three recombination frequencies should be much lower than 0.5 for two closely 
linked loci, and around 0.5 for two loci apart far away or independent. Generally 
speaking, the estimated recombination frequencies should not be much larger than 
0.5. If indeed an estimate is much larger than 0.5, it may indicate the presence of 
coding errors. For example, if îm = 0.9 for two complete markers. That the 
estimate is much larger than 0.5 may be caused by a coding error at one locus, i.e., 
the allele in parent C was wrongly coded as D, and the allele in parent D was 
wrongly coded as C. 

Except for the scenarios between Categories II and V of markers, and between 
Categories III and V of markers (table 7.6), theoretical frequencies of identifiable 
genotypes can be found in different tables in §7.1 and §7.2 together with the esti- 
mates of three recombination frequencies, which will not be repeated in this section. 
If locus 1 is Category II or III, and locus 2 is Category V, locus 1 has two identifiable 
genotypes, and locus 2 has three identifiable genotypes. There are six identifiable 
genotypes when both loci are considered together in the linkage analysis. Theoretical 
frequencies of the six genotypes are given in table 7.13. Obviously, table 7.13 
becomes identical to table 7.9 if replacing 1 — r with r, and r with 1 — r, where r can 
be either ru or rp. 

Let mn be the sample sizes of six identifiable genotypes in table 7.13. 
Theoretical frequencies of the second and fifth genotypes are both equal to one 
quarter, taking half of the population but providing no information in estimating the 
recombination frequency. The male recombination frequency between Cate- 
gories II and V of markers can be estimated from sample sizes of the other four 
genotypes, i.e., equation 7.21. The female recombination frequency between 


Tas. 7.13 — Theoretical frequencies of the six identifiable genotypes between Categories II 
and V, and between Categories III and V of markers. 


Genotype Between Categories II and V Between Categories III and V 
Locus 1 Locus 2 Theoretical Locus 1 Locus 2 Theoretical 

frequency frequency 
1 1 

1 XC, A242 I™ A,X, A2Aş ri (1 — m) 
1 1 

2 XıCı AzBə 1 A,X, Ay By ri 
1 1 

3 XC, BL By 1 (1 — ru) A,X, By By rid 
1 1 

4 XD, AAs Fi (1 — ru) BX, AAs rig 
1 1 

5 XD, A,B, 1 BıXı A, By r 
1 1 

6 XD, By By I™ BiXi BöBə I (1 — rp) 
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Categories III and V of markers can be estimated from sample sizes of the other four 
genotypes, t.e., equation 7.22. 


fu = 25. (between Categories II and V of markers) (7.21) 
m + 13:4 + NG 

Tp = 5. (between Categories II and V of markers) (7.22) 
Ti + 73:4 + NG 


7.38.2 Equivalence Between the Double Cross F, 
of Pure-Line Parents and Hybrid F; 
of Heterozygous Parents 


In the double cross F; population where genotypes of the four pure-line parents are 
known, alleles A, B, C, and D at each complete locus can be traced back to the four 
parents A, B, C, and D, respectively. Considering two linked loci denoted by 1 and 2, 
the genotype of the single cross AB is A,B,A2Bs, and the linkage phase is 
A, Aə/ Bı Bo; the genotype of the single cross CD is C1 Dı C5Də, and the linkage phase 
is C,C,/D Dy . Linkage phases in two single crosses AB and CD are known before 
linkage analysis. When rebuilding the four haploid types in two single crosses, HapA 
will have alleles A at all loci; HapB will have alleles B at all loci; HapC will have 
alleles Cat all loci; and HapD will have alleles D at all loci. The four haploid types in 
two single crosses are equivalent to the haploid types of the four pure-line parents. 

In the hybrid F; population from heterozygous parents introduced in §7.1 and 
§7.2, there is no guarantee that alleles A, B, C, and D are located on the four 
respective haploid types ahead of the genetic analysis. As far as the two linked loci 
are concerned, the genotype of the female parent can be either 4:43 /.5:5ə or 
A,B,/B,Az; the genotype of the male parent can be either C,C,/D,D, or 
Ci D2/D, C2, waiting to be determined. The unknown linkage phases complicate the 
gene mapping methods in such populations. Fortunately, linkage phases in 
heterozygous parents can be determined by linkage analysis, as has been shown in 
§7.1 and §7.2, from which the four parental haploid types can be rebuilt. If the four 
rebuilt haploid types are viewed as the haploid types of four inbred lines, the hybrid 
F, from two heterozygous parents becomes equivalent to the double cross Fy 
(figure 7.6). Obviously, in a double cross F) population where the polymorphic loci 
are only screened in the two single crosses and their F; progenies, genotypes of the 
four inbred parents are unknown, and alleles A, B, C, and D at each polymorphic 
locus cannot be traced back to four inbred parents. In this case, the double cross F; 
has to be treated as the hybrid F; from two heterozygous parents where the linkage 
phases are unknown. 

Cassava (Manihot esculenta Crantz) is a typical diploid clonal species, which can 
be propagated both asexually by cutting branches and sexually by pollinated seeds. 
In some situations, even the selfing pollination can be made to generate fertile seeds. 
In practice, if selfing is possible, one segregating genetic population can be generated 
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5.55 Inbred A X Inbred B Inbred C x Inbred D 


type A type B type type D 
R A hk A | | 
Female Male Single x Single cross 
parent AB x parent CD cross AB CD 
R | A | 
Hybrid F, population /ç............................... > Double cross F, population 
from two heterozygous from four inbred parents 


parents 


Fic. 7.6 — Schematic representation of the equivalence between the hybrid F, from two 
heterozygous parents and the double cross F4 from four inbred parents. 


by selfing one heterozygous clonal line, for example a clonal variety or genetic 
material in cassava (figure 7.7). Such one population can be treated as a special case 
of the hybrid F; from two heterozygous parents, i.e., the female and male parents are 
the same, and all markers belong to Category AB = CD. Linkage analysis methods 
introduced in §7.1 and §7.2 are still applicable to such populations. When the two 
haploid types HapA and HapB are rebuilt for the heterozygous parent, the popu- 
lation becomes equivalent to one hybrid F» from two homozygous parents. For one 
Fə population from two inbred parents, if the genotyping is only conducted for the 
hybrid F, and the selfed F> progenies, genotypic data is not available for the two 
inbred parents. In this situation, the Fə population has to be treated as the special 
hybrid F; using the same heterozygote as both female and male parents, where the 
linkage phase in the heterozygote is unknown before genetic analysis. 

Crosses among three homozygous parents are also common in plant breeding. The 
first case is that one single cross is firstly made between pure lines A and B, which is 
then crossed with the third pure line C. The cross thus made is called a top cross or a 
three-way cross, which is denoted as (A X B) X C or A/B//C. The top cross can be 
treated as a special case of double cross, i.e., pure line C is exactly the same as pure line 
D, and all markers belong to Category III. The second case is that both pure lines A 
and C are crossed with pure line B, and then the two single crosses AB and CB are 
crossed. The cross thus made is denoted as (A x B) x (C x B) or A/B//C/B. Insuch 
one double cross population, alleles Bat all loci are the same as alleles D. Therefore, the 
segregating population generated either from (A x B) x C or (A X B) x (C x B) 
can be treated as the double cross F; population in genetic analysis. 

In summary, the analysis methods introduced in this chapter have a wide rep- 
rehensive in genetic studies. They are applicable to the full-sib families using two 
individuals in a random-mating population as parents, the hybrid F, using two 
clonal varieties (or genetic materials) as parents in a clonal species, and the double 
cross F; using four pure lines (or inbred lines) as parents. In addition, the methods 
are also applicable to the selfed progenies from one heterozygous parent, the Fə 
population where the genotypes of two pure-line parents are unknown, and 
three-way and four-way crosses from three pure-line parents. 
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Haploid Haploid 
type A type B 


b Q 1 
” K 
AN Q 
A ğ 
A Q 
A ” 
zə Q 
A ” 
N ğ 
A Q 


Heterozygous parent AB Single cross AB 
A 


: le lə 


The selfed progenies from D > The F, progenies from 
one heterozygous parent two inbred parents 


Inbred A x Inbred B 


Fic. 7.7 — Schematic representation of the equivalence between the selfed progeny from one 
heterozygous parent and the hybrid Fə from two inbred parents. 


7.38.8 Genotypic Frequencies at Three Complete Markers 


As has been shown in §7.1 and §7.2, from the genetic analysis in the progeny 
population, linkage phases in the two heterozygous parents can be determined; 
haploid types in the female and male parents can be rebuilt. Therefore, the hybrid F: 
population derived by two heterozygous parents where the linkage phases are 
unknown becomes equivalent to one double cross F; derived by four inbred parents 
where the linkage phases are clear. So from now on, only the double cross population 
is concerned with the situation when linkage phases are known. In this section, 
theoretical frequencies at three completely informative and linked markers are given 
first, followed by some discussions on the imputation of incomplete and missing 
marker information. To be consistent with the contents of QTL mapping to be 
introduced in §7.4, the third locus is treated as one QTL between the two complete 
markers. Theoretical frequencies of the 64 genotypes are given in table 7.14, where 
Tı, Tə, and r are the combined recombination frequencies between the left marker 
(.e., locus 1) and locus q, between locus q and the right marker (i.e., locus 2), and 
between locus 1 and locus 2, respectively. Genotypic frequencies when considering 
locus 1 and locus 2 together can be found in table 7.2. 

Considering three loci, i.e., 1, q and 2, at the same time, genotypes of the female 
and male parents are denoted as A,A,A2/B,B,B2 and C,C,C2/D,D,D2. The the- 
oretical frequencies of the eight female gametes and eight male gametes are the same 
as those given in figure 4.4 in §4.2. Random uniting between female and male 
gametes produces a total of 64 genotypes in the progenies, and their theoretical 
frequencies are given in table 7.14. For example, female gamete A,A,A2 and male 
gamete Ci C, Cə are generated by non-crossing between locus 1 and locus q, and 
between locus q and locus 2, both of frequency 5 (1 — r))(1 — rə). They unit ran- 
domly and produce the first genotype in the progenies as given in table 7.14, i.e., 
A, AgAo/C,CyC, (or Ay C\ AgCyA2 Cə), whose theoretical frequency is the product of 
two gamete frequencies, being equal to 1(1 — r1) (1 — rə)”. Frequencies of the other 
genotypes can be found by the same method. 


Locus 1 


AG 
AG 
ADı 
ADı 
AG 
AC, 
AiD, 
AiD, 
BC 
5. 
B,D 
B,D, 
BC, 
BÖ, 
B,D, 


BİLD, 


Note: the three loci are denoted as 1, q and 2. Locus q is between locus 1 and locus 2. rı, rə and r are the combined recombination frequencies between locus 1 and 


Locus 2 


AC 
AD> 
A0 
AD 
AG 
B,D» 
B,C, 
BoD» 
Asc, 
AyD» 
2:65 
AD» 
B,C, 
B,D» 
B,C, 


Bə Dz 


Locus q (located between locus 1 and locus 2) 


AC 

20 - nP- n)? 
(1— n) n(- m) 
n(1- n)(1- m) 
mi(l — ri)rə(1 — rə) 
(1 — n)”n(1— m) 
(1 - n)”n 

nil — ri)rə(1 — r) 
n(1— ni) 
na-n- m)” 


Ti(1 — ri)rə(1 — rə) 


phy 
Ss 
x 
- 
| 
© 
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nil — ri)rə(1 — r) 


n(1 — ri)r2 


da RBI RBI RBI RBI RBI RBI RBI RBI RP BI RBI RBI RBI RBI RBI 


AD, 

Ti(1 — ri)rə(1 — r) 
n(1— r- m)” 
(l-n) rl- r) 
(1 — n)”(1 — ə)? 


rp 


x 
= 
- 
| 
x 
ə 


n( = ri)r(1 = Tə) 
yi 


(1-— n)”n(1 — m) 


— 
= 
| 

pər 


nd = Ti)rə(1 = Tə) 


rı(1 = nj = rə)” 


ml RI BIR BIR BIR RIP BIR RIP RIP RI RP RIP RI RI BIR RIE BIE 


nd > Tiı)rə(1 a Tə) 


locus q, between locus q and locus 2, and between locus 1 and locus 2, respectively. 


4 


(1-— n)”n(1 — m) 


A n(1—n)(1— mə)” 


A m1 — ri)rə(1 — r) 


TAB. 7.14 — Theoretical frequencies of the 64 genotypes at three complete loci in the double cross F; population. 
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7.3.4 Imputation of Incomplete and Missing 
Marker Information 


Taking the integrated software GACD (Zhang et al, 2015c) as an example, 
table 7.15 shows the genotypic codes for the five categories of markers in double 
cross Fı populations. Category I markers allow the four genotypes to be separated in 
the progenies and therefore provide complete information in genetic analysis. For 
the other categories of markers, only two or three genotypes can be separated in the 
progenies, and therefore part of the information is missing in genetic analysis. The 
missing information is represented by X in genotypic codes. Completely missing 
marker types are coded as XX for all categories. 


Tas. 7.15 — Coding criterion of the identifiable genotypes in the GACD software together 
with their Mendelian ratios for the five categories of markers in double cross F; population. 


Marker category Coding of the identifiable Mendelian Coding of completely 
genotypes ratio missing marker type 

I (ABCD) AC, AD, BC, BD 1:1:1:1 XX 

II (A = B) XC, XD 1:1 XX 

M (C = D) AX, BX 1:1 XX 

IV (A = CB = D) AA, AB, BB 1:2:1 XX 

V (A -“ DB-—C) AA, AB, BB 1:2:1 XX 


Imputation of the incomplete and missing markers can avoid some unnecessary 
difficulties in the following gene mapping studies. The missing information is 
imputed by probability calculated from the linkage relationship. For markers 
belonging to Category II, genotype XC will be replaced by either AC or BC, and 
genotype XD will be replaced by either AD or BD. For markers belonging to Cate- 
gory III, genotype AX will be replaced by either AC or AD, and genotype BX will be 
replaced by either BC or BD. For markers belonging to Category IV, genotype AA 
will be replaced by AC, genotype AB will be replaced by either AD or BC, and 
genotype BB will be replaced by BD. For markers belonging to Category V, genotype 
AA will be replaced by AD, genotype AB will be replaced by either AC or BD, and 
genotype BB will be replaced by BC. The completely missing marker type XX will be 
replaced by either AC, AD, BC or BD. After the imputation, all markers belong to 
Category I with four fully informative genotypes AC, AD, BC and BD, which will 
greatly simplify the linkage mapping methodology of quantitative trait genes. 

The imputation of incomplete and missing markers is conducted by their orders 
on the linkage map. Assume all markers before the current one to be imputed belong 
to Category I. The basic idea of imputation is to calculate the probabilities of 
possible genotypes, and then impute the incomplete marker types with complete 
marker types by probabilities. Imputation is conducted separately for individuals in 
the progeny population, and the imputation results may be different for different 
individuals even for the same incomplete or missing marker type. For example, 
genotype XC may be imputed to AC in one individual, but imputed to BC in 
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another individual. Three situations occur in imputation, based on the number of 
fully-informative markers which are linked with the current marker to be imputed. 


1. No linkage information can be utilized 

The current marker belongs to this situation if it is not linked with any other 
markers, for example. As there is no linkage information that can be utilized, 
probabilities of the four complete genotypes AC, AD, BC, and BD are equal for 
missing genotype XX, each of 0.25 (equation 7.23). Take a random number rd from 
the uniform distribution between 0 and 1. If rd < 0.25, XX will be replaced by AC; if 
0.25 < rd < 0.5, XX will be replaced by AD; if 0.5 < rd < 0.75, XX will be replaced 
by BC; otherwise, XX will be replaced by BD. In the next, only the probabilities of 
complete genotypes under the condition of incomplete markers are given. The 
imputation method is similar to XX, which will not be repeated. 


P(ACIXX) : P{AD|XX}: P{ BC|XX}: P{ BD|XX} 


7.23 
= 1:1:1:1 (completely missing) -— 


If the current marker belongs to Category II, probabilities of AC and BC are 
both equal to 0.5 for incomplete genotype XC (equation 7.24); probabilities of AD 
and BD are both equal to 0.5 for incomplete genotype XD (equation 7.25). 


P{AC|XC} : P{BC|XC} = 1:1 (Category II markers) (7.24) 


P{AD|XD} : P{BD|XD} = 1:1(Category II markers) (7.25) 


If the current marker belongs to Category III, probabilities of AC and AD 
are both equal to 0.5 given the incomplete genotype AX (equation 7.26); proba- 
bilities of BC and BD are both equal to 0.5 given the incomplete genotype BX 
(equation 7.27). 


P{AC|AX} : P{AD|AX} = 1:1 (Category Hİ markers) (7.26) 


P{BC|BX} : P{BD|BX} = 1:1 (Category III markers) (7.27) 


If the current marker belongs to Category IV, incomplete genotype AA in the 
progenies can only be AC (equation 7.28) and will be replaced by AC. The proba- 
bilities of AD and BC are both equal to 0.5 given the incomplete genotype AB 
(equation 7.29). So genotype AB will be replaced by AD or BC by the ratio of 1:1. 
Incomplete genotype BB in the progenies can only be BD (equation 7.30), and will 
be replaced by BD. 

P{AC|AA} = 1 (Category IV markers) (7.28) 


P{AD|AB} : P{ BC|AB} = 1:1 (Category IV markers) (7.29) 


P{ BD| BB} = 1 (Category IV markers) (7.30) 
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If the current marker belongs to Category V, incomplete genotype AA in the 
progenies can only be AD (equation 7.31) and will be replaced by AD. The proba- 
bilities of AC and BD are both equal to 0.5 given the incomplete genotype AB 
(equation 7.32). So genotype AB will be replaced by AC or BD by the ratio of 1:1. 
Incomplete genotype BB can only be BC (equation 7.33), and will be replaced by BC. 


P{AD|AA} = 1 (Category V markers) (7.31) 
P{AC|AB}: P{BD|AB} = 1:1 (Category V markers) (7.32) 
P{BC|BB} = 1 (Category V markers) (7.33) 


2. One fully-informative linked marker can be utilized 

The current marker belongs to this situation if it is located at either end of the 
linkage map, for example. Suppose locus 1 in table 7.2 is the fully-informative linked 
marker, and locus 2 in table 7.2 is the current locus q to be imputed. Theoretical 
frequencies of the 16 complete genotypes are represented by the combined recom- 
bination frequency r and summarized in a two-way table, i.e., table 7.16. The 16 
frequencies on r given in table 7.16 can be acquired by replacing both rr and ru with 
r in the theoretical frequencies given in table 7.2. 

Obviously, the marginal frequencies at locus q depend on what the genotype is at 
locus 1. Take the genotype A) C4 at locus 1 as an example to show the probabilities of 
complete genotypes, given the incomplete genotypes at the current locus q to be 
imputed. Both genotypes AA and BB at Categories IV and V markers can only 
be one complete genotype, same as equations 7.28, 7.30, 7.31, and 7.33, and will not 
be given here again. Probability ratios as given in equations 7.34-7.40 are acquired 
from the theoretical frequencies corresponding to genotype A Cı at locus 1 as given 
in table 7.16. 


TAB. 7.16 — Theoretical frequencies of the 16 complete genotypes at two fully-informative 
loci. 


Locus 1 Locus q (r is the combined recombination frequency with locus 1) 
AC, AgDa ByCg B Dg 

AQ Mi Tr(a- r) Ir-r) 7 

ADı Fra -= r) r —r)” ie ira —r) 

BC, a=) r a-r? Era- r) 

B,D, 7 Tr(a- r) Ir-r) a-r 


Notes: Locus q is incompletely informative to be imputed. Locus 1 is fully-informative and 
linked with locus q with the combined recombination frequency r. 
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For example, when one individual has the incomplete genotype XX at locus q, and 
the complete genotype AC at locus 1, frequencies of the four possible complete 
genotypes at locus q are equal to 1(1 — r)”, 1r(1 — r),4r(1 — r), and 1r? (see the first 
row in table 7.16). The sum of the four frequencies is equal to L The four frequencies 
when divided by the sum give the probabilities of the complete genotypes (equa- 
tion 7.34), by which the missing genotype XX can be imputed to AC, AD, BC, or BD. 

For another example, when one individual has the incomplete genotype XC at 
locus q of Category II, and the complete genotype AC at locus 1, frequencies of the 
two possible complete genotypes AC and BC for XC at locus q are equal to 4 (1 — r)” 
and 17(1 — r) (see the first row in table 7.16). The sum of the two frequencies is 
equal to (1 — r). The two frequencies when divided by the sum give the proba- 
bilities of the two complete genotypes (equation 7.35), by which the incomplete 
genotype XC can be imputed to AC, and BC. 


P{AC|XX} : P(ADIXX) : P{BC|XX} : P(BDİXX) 


— (1 — r)” : r(1 — r) : r(1 — r) : r? (completely missing) oe 
P{AC|XC} : P{BC|XC}  1-— r: r (Category II markers) (7.35) 
P{AD|XD} : PİBDİXD)  1-— r : r (Category II markers) (7.36) 
P{AC|AX} : P{AD|AX} = 1-— r : r (Category III markers) (7.37) 
P{BC|BX} : P{BD|BX} = 1 — r : r (Category II markers) (7.38) 
P{AD|AB} : P{BC|AB} = 1 : 1 (Category IV markers) (7.39) 
P{AC|AB} : P{BD|AB} = (1 — r)” : r? (Category V markers) (7.40) 


3. Two fully-informative flanking markers can be utilized 

The current marker belongs to this situation if it is located in the middle of the 
linkage map, for example. The two fully-informative flanking markers are treated as 
locus 1 and locus 2, and the marker to be imputed is treated as locus q which is 
located between locus 1 and locus 2 on the linkage map. Under the condition of each 
of the 16 complete genotypes at the two flanking loci, theoretical frequencies of the 
four genotypes at locus q (the locus to be imputed) can be calculated from the joint 
frequencies given in table 7.14. In this situation, genotypic frequencies at locus q 
depend on the joint genotypes at locus 1 and locus 2. Probability ratios as given in 
equations 7.41-7.47 are acquired from the theoretical frequencies corresponding to 
genotype A, C1 ÁC at locus 1 and locus 2 as given in table 7.14. 
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For example, when one individual has the incomplete genotype XX at locus q, and 
the complete genotype AC at both locus 1 and locus 2, i.e., 4) C1 Ag Cy, frequencies of 
the four possible complete genotypes at locus q are equal to (1 - nya — rə)”, 
gn(l — n)m(l— m) 471 — 1)r(1 — mə), and jr? rə (see the first row in table 7.14). 
The sum of the four frequencies is equal to (1 — r)’, which is the theoretical fre- 
quency of the joint genotype at locus 1 and locus 2. The four frequencies when 
divided by the sum give the probabilities of the complete genotypes (equation 7.41), 
by which the missing genotype XX can be imputed to AC, AD, BC, or BD. 

For another example, when one individual has the incomplete genotype XC at 
locus q of Category II, and the complete genotype is A) C1 Aş Cə at locus 1 and locus 
2, frequencies of the two possible complete genotypes AC and BC for XC at locus q 
are equal to 1(1 — n)”(1 — m)” and Fri(1-— ri)r(1— m) (see the first row in 
table 7.14). Therefore, the probability ratio can be acquired as given in 
equation 7.42, by which the incomplete genotype XC can be imputed to AC or BC. 


P{AC|XX} : P{AD|XX} : P{BO|XX} : P{BD| XX} 
— (1 — rı)”(1 — m)” : nl- rel- mə) : nd — ni)mə(1 — ra) (7.41) 


: r?rs (completely missing) 
P{AC|XC} : P{BC|XC} = (1 — 1)(1 — rə) : rirə (Category I markers) (7.42) 
P{AD|XD} : P{BD|XD} = (1 — m)(— rə) : rirə (Category II markers) (7.43) 
P{AC|AX} : P{AD|AX} = (1 — m)(1 — rə) : rirə (Category II markers) (7.44) 
P{BC|BX} : P{BD| BX} = (1 — r))(1 — rə) : rirə (Category II markers) (7.45) 
P{AD|AB} : P{BC|AB} = 1: 1 (Category IV markers) (7.46) 


P{AC|AB} : P{BD|AB} = (1 — r)’ (1 — rə)” : r?r3 (Category V markers) (7.47) 


7.4 QTL Mapping in the Double Cross F, Population 
Derived from Four Pure-Line Parents 


As has been observed in §7.1 and §7.2 that the unknown linkage phases in 
heterozygous parents can be identified by the linkage analysis in their F, progenies. 
Once the parental haploid types are rebuilt, the hybrid Fı from two heterozygous 
parents can be treated as one double cross F; from four virtual inbred parents. Based 
on the constructed linkage map, incomplete and missing marker types in the double 
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cross F; population can be imputed and converted to the fully-informative marker 
types, and therefore all markers belong to Category I after imputation. All these 
tasks can be completed by the integrated software GACD (Zhang et al., 2015c). 
Without the loss of generality, only the double cross F; population is considered in 
QTL mapping in this section under the assumption that all markers belong to 
Category I and there are no missing marker types. The readers can also refer to 
Zhang et al. (2015b) for more details not covered in this section. 


7.4.1 One-QTL Genetic Model in Double Cross 
F, Population 


Consider the one-QTL model first, and assume Aq, By, Cy, and D, are the four alleles 
at the QTL. The genotypic value of an individual with a known QTL genotype, i.e., 
AgCy, AqDq, By Cg, or ByDz; is written in equation 7.48. 


G = u+ aut bv + duv (7.48) 


where u is the overall mean of the four QTL genotypes on the phenotypic trait in 
interest; u and v are two orthogonal indicators of QTL genotype valued at 1 and 1 
for A Cq, 1 and —1 for A,D,, —1 and 1 for B4Cq, and —1 and —1 for Bu, a is the 
additive genetic effect of the female parent, measuring the difference between the 
two female-parent alleles A and B; b is the additive genetic effect of the male parent, 
measuring the difference between the two male-parent alleles C and D; and d is the 
dominant effect between the female and male parents, or the intra-genic interaction 
between female and male alleles. 

Let u; (i = 1, 2, 3, 4) be the phenotypic means (or genotypic values) of the four 
QTL genotypes. From equation 7.48, the overall mean and three defined genetic 
effects can be calculated, as shown in equation 7.49. 


1 1 
u= qün + He + u3 + pa), a = yün + He — H3 — lu), 
(7.49) 


1 
(Hi — Ha + Hg — H4), d = Z (Hi — Hə — Uş + M4) 


4 


ml 
4 

When there is no segregation distortion, frequencies of the four QTL genotypes 
AgCy, AgDq, ByCq, and B,D, are all equal to 0.25 in the progenies, and the rela- 
tionship between genetic variance contributed by the QTL and its genetic effects can 
be derived, as given in equation 7.50. 


1 
Ve =7 


= a?” + b+ d 


1 
(HE ağ + H3 +) — İy (Ha + Ho + ps + pua) (7.50) 


Assuming that A,, By, Ci, Dı and A», By, Cə, Də are the four alleles at two 
complete markers flanking the QTL, there are a total of 16 identifiable marker 
classes in the double cross population. Let rı, rə, and r be the combined recombi- 
nation frequencies between the left marker and QTL, between the QTL and right 
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marker, and between the two flanking markers, respectively. Similar to the QTL 
genotypic indicators u and vas defined in equation 7.48, indicators are also defined 
for the two flanking markers, which are represented by zı and yı for the left marker, 
and zə and gə for the right marker. Shown in table 7.17 are the values of marker type 
indicators, expectations of the QTL indicators u and v together with their product 
uv, and phenotypic means of the 16 marker classes, where fi and fọ are functions 
depending only on the three recombination frequencies as given in equation 7.51. 


Ah. 1-— 2r)/(1-— r), p = (ri — rə)/r (7.51) 


Two other orthogonal variables, i.e., gg and gə, are defined out of fi and fy, i.e., 
equation 7.52, which are dependent only on the three recombination frequencies 
either. It can be shown that under each marker class, expectations of the two QTL 
indicators u and v, and their product uv can be represented by particular linear 
combinations out of the indicators of two flanking markers, which are given in 
equations 7.53-7.55, respectively. 


n-2(h- b), w= 5 (hth) (7.52) 
E(ulzi, yi) 22, yə) = giz gx? (7.53) 
Eula, m, 22, Y2) = gin + Hy (7.54) 
E(uolay, m, %, yə) = g tyi + gamy + gigə(Tiye + my) (7.55) 


Therefore, the phenotypic mean under each marker class can be represented by a 
linear combination of marker-type indicators, i.e., equation 7.56. If the coefficients 
before marker type indicators in equation 7.56 are viewed as the effects of markers, 
additive effects a and b at the QTL can cause the additive effects on flanking 
markers, i.e., the coefficients before variables zz, y1, zə, and yo; dominant effect d at 
the QTL can cause the dominant effects of flanking markers, i.e., coefficients before 
product variables x,y, and səyə, and the interaction effects between the two flanking 
markers as well, i.e., coefficients before variable (x,y. + my,). The interaction 
between two flanking markers as shown in equation 7.56 is actually similar to the 
epistatic effects between markers caused by the dominant effect of QTL in 
bi-parental Fə populations as has been introduced in chapter 5. 


E(Glar, yi, aş, yə) = u+ (agi) + (bgi)yi + (dg)aıy 


(7.56) 
+ (ag) a + (bg) yo + (dg>) 2 yo + (dog) (a1 yə + By) 


In equation 7.56, let symbols «1, 9) and ô; represent the QTL effects on the left 
marker, xə, Pa, and d2 represent the QTL effects on the right marker, and t represent 
the interaction effect between the two flanking markers. Thus a linear relationship 
can be built between the phenotypic mean of each marker class and the indicators of 
marker type corresponding to the marker class, i.e., equation 7.57. 


TAB. 7.17 — Indicators of two flanking markers, expectations of QTL indicators, and phenotypic mean of each of the 16 marker classes in the 
double cross F; population. 


Left Right Marker type E(ul2, yi 2, ə) E(o|21, yi) 22, y2) E(uvla, m, a?) gə) Phenotypic mean 
marker marker indicators 

T YW 12 Y2 
AC, Az C, 1 1 1 1 1-5—2rm/(1— r)84 1 — 2rir,/(1 — r)=ft in u+fia+fib+ fed 
AC, AD, 1:1 1 -1 1-— 2rir/(1 — r)=ft Tı — r))/r— — h -fih u+fa—hb-fhd 
A,D Ag Cy 1-11 1 1-2nm/A-r)=f (rı — )/r5ğ AR u+ha+hb+fihd 
AD AzDə 1 -1 1 -1 1-— 2rym/(1 — r)mf 14-2nr,/(1 — s)” —  —f) u+fia— hb fed 
AC, BoC, 1:1 -1 1 -—frı — m)/r— — $ 1 — 2rir,/(1 — r)=ft -fih u—ha+fb-—fhd 
AC, BoDy 1:1 -1 -1 -(m-1)/r=— ğ —(rı — rə)/r— — b m u-fa-fb+frd 
A,D BoC, 1 -—I -1 1 -—frı — m)/r— — ğ (rı — )/r=fr -f u—ha+ fb — fd 
A,D BD» 1 1 -1 -1 Tı — r))/r— — fa —1 --.2r:r,/(1 — r)” — fi Ak u— ba— hb fd 
BC AoC, -1 1 1 1 (n-1)/r=fh 1 — 2rir,/(1 — r)=ft hb ub bad hb Abd 
BC AzD, -1 1 1 -—l (r: — m)/r”fğ —(rı — rə)/r— — b -f u+ ha- £b-— ğd 
BLD, Az Cy -1 -1 1 #1 (n-1)/r=h (rı — ™)/r=fh ğ u+ fatfhb+frd 
B,D, Aə>Də -1 -1 1 -1 (mn — m)/r—ğ —1 +2nr/(1-r)=£-f -fik u+ ba — fb — hd 
B,C B,C, -1 1 -1 1 -142rn/A-r)2-ff 1-2 n/1-n=f —/? nu ha fb -— f?a 
BC BD 1 1 1 -1 -142nnm/-r)=-fi Tı — rə)/r— — fə hb u— ha — bb Afad 
BD, BoC -1 -1 -1 1 -14+2rn/-r)=-fi (ni — m)/r56 -Ah u- hac hb-— hd 
B,D, B,D, 1 -1 -1 -1 -—1-2rirə/(1 — r)£ — fi —14-2rir,//(1 — r)” — fi f u— ha-— fb- fed 
Note: rı, rə and r are the combined recombination frequencies between the left marker and QTL, between QTL and the right marker, between 


two flanking markers, respectively. 


VEE 


Surddeyy əuər) pue sısAyeuy əSeyur? 
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E(Gla, m, aş, Y2) = y+ tı + piyi + ony 


(7.57) 
+ aam + Bəyə + domy + vÜzi yə + əyi) 


7.4.2 The Linear Regression Model of the Phenotype 
on Marker Type for Multiple QTLs 


For convenience, assume there are a number of m QTLs located on m intervals 
defined by m + 1 markers on one chromosome. Ignoring the overall mean in 
equation 7.48, genotypic values from one QTL can be represented by equation 7.58, 
where u; and o) are the indicators of genotypes at the jth QTL; and a,, bj and d; are 
the genetic effects of the jth QTL. 


G; = ajuj + bjvj + djujvj, where j = 1,2,...,m (7.58) 


From equation 7.57, the phenotypic mean of the jth QTL under each marker class 
can be represented by equation 7.59, where oj, 8; and öyə are the jth QTL effects on 
the left flanking marker; «2, Pj and 0; are the jth QTL effects on the right flanking 
marker; and 1; is an interaction effect between the two flanking markers. 


E(Gilzi, yi, i415 Yi) = yaz + Bp yy + öyaziyi + Hee 


(7.59) 
“520041 +6; 20) 41941 TU + B+1y) 


If there is no QTL on the jth marker interval, the three QTL effects as given in 
equation 7.58 are all equal to 0, and so are the seven marker effects as given in 
equation 7.59. When considering the multiple QTLs together and assuming the 
additivity of genetic effects from different QTL genotypes, the genotypic value of an 
individual in the double cross progenies can be represented by equation 7.60. 


G=u+ 2 G; = u+ 5 [ajuj + bv; + djujv;] (7.60) 


j=l jel 
From equation 7.59, the expectation of genotypic value under each marker class 
can be acquired and given in equation 7.61. 


m+1 m+1 m 
E(G) = p+ 21 (Ajay t By) 2. Diziyi + 2.) Glut giy) (7-61) 
- : 


I= j=l j=l 
where, 
A = 11, Bı = Bia, D, = O14 
Aj = 4j-12+%1, By = By12+Bj1, Dj = özlə + 651 (J = 2,3,..., m) 
Am+1 = m2; Bn+1= Pm Dm+1 = Om2 
T; = (4 =1,2,...,m) 
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Therefore, the inclusive linear model of phenotype on marker type can be given 
in equation 7.62. 


m+1 m+1 m (7.62) 
= p+ $ (Ajay + Bi) + XO Dimy x X Tlayjsitgsiy)te 
j=l 


j=l fel 


where P is the phenotypic value of an individual and £ is the random 
environmental error. It can be seen from the deduction of equation 7.61 that the 
coefficients of marker variables in equation 7.62 are only affected by QTL located 
in the neighboring marker intervals. In other words, the genetic effects of one 
QTL can be completely absorbed by variables of the two most closely linked 
markers. Therefore, if the linear model defined by equation 7.62 can be precisely 
estimated, the effects of all QTLs would be included in the coefficients of the 
linear model. Based on this property, the background genetic variation can be 
well controlled in the QTL interval mapping. In integrated software GACD 
(Zhang et al, 2015c), stepwise regression is used to estimate the marker coeffi- 
cients in equation 7.62. The coefficients of those variables not selected by step- 
wise regression are set at 0. 


7.4.8 Inclusive Composite Interval Mapping (ICIM) 
in the Double Cross F, Population 


Assume there are a number of n progenies in the double cross population that have 
been both genotyped and phenotyped. The genome-wide scanning is conducted on 
the adjusted phenotypic values as given in equation 7.63 in order to exclude the 
influence of background genetic variation. 


AP; = P; — 5 (A; ay + Biy) — 5 Dj yij — x məq m 
gpöek,k-l JAK K+1 jfk 


(7.63) 


where k and k+ 1 represent the two flanking markers of the current scanning 
position; ? = 1, 2,..., n representing the n progenies in population; the hat symbol 
means the estimated value of regression coefficient; x, and yy are genotypic 
indicators of the ith progeny at the jth marker. When there is at least one blank 
marker interval between two linked QTLs, the adjusted phenotype AP; as given in 
equation 7.63 only contains the location and effect information of QTL in the 
current interval. QTLs in other intervals and other chromosomes have been 
completely controlled due to the adjustment, and will not have any influence when 
AP; is used to test the existence of QTL and estimate the genetic effects of QTL at 
the current position. 
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At the current scanning position, phenotypic observations of the four QTL 
genotypes A,C,, AgD,, ByCy, and B,D, are assumed to be normally distributed as 


N(u,, o?) (k= 1, 2, 3, 4). The two hypotheses used to test the existence of a QTL 


are, 


Ho : fy = My = pg = My and 
HA : at least two among 4, Hə, H3 and u4 are not equal (i.e., no restriction) 


Under Hp, phenotypic observations of the four QTL genotypes follow the same 
normal distribution. Estimates of the mean and variance of the distribution are 
equal to the sample mean and sample variance of the observed progenies in the 
population. The logarithm likelihood under H; is, 


16 4 
La=X h b ma/(AP), Mp, 2 (7.64) 


j=l ies; kol 


where S represents the progenies belonging to the jth marker class (j = 1, 2,..., 16); 
Tj (k = 1, 2, 3, 4) is the conditional probability of the kth QTL genotype in the jth 
marker class, which can be calculated by dividing the frequencies in the kth column 
in table 7.14 by the marginal theoretical frequencies of the jth row or the jth marker 
class; and f(e; vz, o?) is the density function of normal distribution N (u, 07). 

EM algorithm for calculating the maximum likelihood estimates in 
equation 7.64 can be found in Zhang et al. (2015b), which will not be described 
here. When the estimates are acquired for parameters under the two hypotheses, 
LRT statistic or LOD score at the current scanning position can be acquired 
accordingly. Under Ho, the four QTL genotypes follow the same distribution with 
one mean and one variance. Namely, two parameters need to be estimated. 
Under Hy, the four QTL genotypes follow different distributions, but their 
variances are all caused by random errors, which can be assumed to be the same. 
Namely, five parameters need to be estimated, i.e., four means and one variance. 
When the progeny population is large enough, the LRT statistic calculated from 
the two hypotheses approaches a x? distribution with 3 degrees of freedom. Power 
analysis of ICIM in the double cross population, and systematic comparisons 
with other mapping methods can also be found in Zhang et al. (2015b). 
Application of ICIM in a maize double cross population with unknown linkage 
phases can be found in Ding et al. (2015) and Chen et al. (2016) for a number of 
phenotypic traits. 

In the next, one simulated double cross population is used as an example to 
illustrate the mapping results from two methods, i.e., simple interval mapping 
(IM) and ICIM. As in bi-parental populations (see chapter 4), IM is based on the 
original phenotypic values in the one-dimensional scanning, and therefore both 
the random errors and background genetic effects are included in the sampling 
variance of QTL genotypic distributions. The size of the simulated population is 
200. The genome consists of 8 chromosomes, each with 15 markers. Profiles of the 
LOD score from the two mapping methods are shown in figure 7.8. Obviously, peaks 
on the LOD profile from ICIM are much more clear, sharper, and higher than those 
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from IM. There are six peaks higher than the threshold 3.0 in the LOD profile from 
ICIM. Detailed information on estimations at the six peaks can be found in 
table 7.18. There are five common QTLs detected by both methods, located on 
chromosomes 2-6, and the estimated values on QTL positions and effects are similar 
to the two methods. IM does not detect the one on chromosome 1, which has the 
smallest effect among the six pre-defined QTLs. 


LOD score 
CrPNWRUDAAT TOO 


ə 


1111112222222333333344444445555555666666677777778888888 


Chromosome number of the scanning positions 


Fic. 7.8 — LOD profiles from IM and ICIM in one double cross F; population. 


In the double cross F; population, each individual is heterozygous. The three 
genetic effects a, b, and d as defined in equation 7.48 measure the relative effects 
between alleles, and may not be easy to compare the performance of four QTL 
genotypes. The last four columns in table 7.18 also give the genotypic values (or 
phenotypic means) of four genotypes at each detected QTL (see equation 7.49), 
by which the phenotypic difference among the four QTL genotypes can be easily 
seen. 

For example, let Qı—Qs denote the six QTLs detected by ICIM. When the higher 
phenotypic value is favored, AC is the best genotype at loci Qı and Qə, BC is the 
best genotype at locus Q3, BD is the best genotype at locus Q4, and AD is the best 
genotype at loci Q; and Qe. Ignoring the epistasis between loci, the combination of 
genotypes AC at Qı and Qə, BC at Q3, BD at Q4, and AD at Q; and Qe will make 
one individual with the best performance. It should be the target genotype in 
breeding which can be selected by the most closely linked markers with the six 
detected QTLs. 


Exercises 
7.1 For the fully-informative Scenario 1 in table 7.5, if alleles C and D at the second 


locus are replaced with A and B, respectively, Scenario 4 is acquired; if alleles A and 
B at the first locus are both replaced with X, and alleles C and D at the second 


TAB. 7.18 — Detailed information of the QTLs detected by IM and ICIM in one double cross F; population. 
PVE (%) 


Method Chromosome Position (cM) 


IM 58 


—0.40 


26 
46 
21 
57 


ICIM 


34 
56 
25 
50 
20 
55 


Çı HA Qo o FID oO KS W DH 


O 


Note: In genotypic values of QTLs detected by ICIM, the highest value among the four QTL genotypes is highlighted by bold. 


LOD score 


5.20 
6.68 
3.54 
3.06 
4.00 
3.27 
5.85 
9.27 
3.45 
4.30 
6.91 


10.17 
13.52 
7.26 
6.20 
8.24 
4.84 
9.01 
15.19 
4.40 
5.88 
10.51 


Genetic effects 


a b d 
2.27 1.27 
—0.69 2.97 0.45 
—2.00 -—Ü.91 0.14 
0.32 1.98 0.81 
0.59 2.32 0.24 
1.75 0.80 —Ü.11 
—0.50 2.07 1:22 
—Ü.64 3.18 0.20 
1.07 1.29 0.14 
0.20 —1.89 —Ü.86 
0.44 2.61 0.71 


Genotypic values 


AC 

30.68 
30.64 
24.73 
24.66 
24.20 
30.31 
30.84 
30.68 
25.41 
25.40 
24.21 


AD 

23.60 
23.81 
26.28 
30.24 
29.33 
28.94 
24.26 
23.93 
28.27 
30.90 
30.86 


BC 

28.93 
31.11 
28.46 
26.91 
25.86 
27.04 
29.39 
31.57 
27.83 
26.73 
26.52 


BD 

26.93 
26.08 
30.55 
29.25 
30.01 
25.23 
27.70 
25.61 
30.12 
28.79 
30.32 


sjueleg snosAzolajaH OMT, JO 14 pLiqAH ul sisAyeuy ononon 


6€E 
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locus are replaced with A and B, respectively, Scenario 6 is acquired. The 
replacement to have Scenarios 4 and 6 from Scenario 1 is given in the following 
table. 


Genotypic Genotype in Scenario 1 Genotype in Scenario 4 Genotype in Scenario 6 


number (two Category I markers) (Categories I and IV) (Categories II and IV) 
1 AC_AC AC_AA XC_AA 
2 AC_AD AC AB XC AB 
3 AD AC AD AA XD AA 
4 AD AD AD AB XD AB 
5 AC BC AC BA XC BA 
6 AC BD AC BB XC BB 
7 AD BC AD BA XD BA 
8 AD BD AD BB XD BB 
9 BC AC BC AA XC AA 
10 BC.AD BC AB XC_AB 
1 BD AC BD_AA XD_AA 
12 BD AD BD AB XD AB 
13 BC BC BC_BA XC BA 
14 BC_BD BC_BB XC BB 
15 BD BC BD BA XD BA 
16 BD BD BD BB XD BB 


(1) VVhat are the identifiable genotypes in Scenarios 4 and 67 

(2) Which genotypes in Scenario 1 are included in each of the identifiable genotypes 
in Scenarios 4 and 6? 

(3) Based on (1) and (2), work out the theoretical frequencies of identifiable geno- 
types in Scenarios 4 and 6 from the theoretical frequencies given in table 7.2. 


7.2 In one hybrid F, population derived from two heterozygous parents with size 100, 
the following table gives the genotypic data at two complete markers M) and Mg. 


Marker 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 


Mı AC AD AC BD AD BC AD BC BD AC BC AD AD BC AC 
Mə BD BC BD AC BC AD BC AC AC BC AD BC BD AD BD 
Mı BC AC BD AC BC AD BD BD BC BD AC BD BC BC BD 
Mə AD BD AC BD AD BC AC AC AD AC BD AC AD AD AD 
Mı BC AD BD AD AD AD AD BC AD BC BC AC BD BD BD 
Mə AD AC AC BC BC BC BC AD AD BD AD BC AD BD AC 
Mı AD BD AC BD BD AD AD BC BC AC BD AD BD BD AD 
M, AC AC BD AC AD BC BC AC AD BD AC AC AC AC BC 
Mı AC AD AC AC BD BD AC AC AC BD AD AD BC AC AC 
Mə BD BC BD BC AC BC BD BD BD AC BC BC AD BC BD 
Mı BD AD AD BD AC BD BD AC AD AD BC AC BC BC AD 
Mə, AC BC BC AD BD BC AC BD BC AC AD BD AD AC BC 
Mı BC BC AC BD AD AC AD AD AD BC 

Mo AD AD BC AC BC BD BC BC BC AD 
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1) Work out the observed sample sizes of four genotypes AC, AD, BC, and BD at 
each of the two marker loci, and test if the four genotypes can be fitted by the 
Mendelian ratio of 1:1:1:1. 

2) Considering the two marker loci jointly, work out the observed sample sizes of 
16 genotypes. 

3) Use equations 7.2 and 7.3 to estimate the female and male recombination fre- 
quencies between the two loci, respectively. 

4) Use equation 7.4 to estimate the combined recombination frequency, and 
identify the linkage phases of the two loci in both parents. 


7.3 In one hybrid F) population derived from two heterozygous parents with a size of 
200, the following table gives the observed sample sizes of the 16 genotypes at two 
complete markers M, and Mb. 


Marker M, Marker Mə 

AC AD BC BD 
AC T 35 1 4 
AD 38 5 3 
BC 0 3 10 34 
BD 3 1 40 13 


(1) Work out the observed sample sizes of four genotypes AC, AD, BC, and BD at 
each of the two marker loci, and test if the four genotypes can be fitted by the 
Mendelian ratio of 1:1:1:1. 

(2) Use equations 7.2 and 7.3 to estimate the female and male recombination fre- 
quencies between the two loci, respectively. 

(3) Use equation 7.4 to estimate the combined recombination frequency, and 
identify the linkage phases of the two loci in both parents. 


7.4 In exercise 7.3, suppose My belongs to Category II, i.e., genotypes AC and BC at 
Mə are not identifiable, which are denoted by XC; genotypes AD and BD are not 
identifiable either, which are denoted by XD. Work out the observed sample sizes of 
the eight identifiable genotypes, and estimate the female recombination frequency 
between the two loci. 


7.5 In exercise 7.3, suppose Mə belongs to Category III, i.e., genotypes AC and AD 
at Mə are not identifiable, which are denoted by AX; genotypes BC and BD are not 
identifiable either, which are denoted by BX. Work out the observed sample sizes of 
the eight identifiable genotypes, and estimate the male recombination frequency 
between the two loci. 


7.6 In exercise 7.3, suppose My belongs to Category IV, i.e., alleles A and Cat Mə 
are not identifiable, and alleles B and D are not identifiable either. The three 
identifiable genotypes at Mə are denoted by AA, AB, and BB. Work out the 
observed sample sizes of the 12 identifiable genotypes, estimate the female and male 
recombination frequencies between the two loci using half of the population, and 
identify the linkage phases at the two loci in both parents. 
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7.7 In exercise 7.3, suppose M, belongs to Category IV, i.e., alleles A and C at M) 
are not identifiable, and alleles B and D are not identifiable either; Mə belongs to 
Category V, i.e., alleles A and D at M, are not identifiable, and alleles B and C are 
not identifiable either. Work out the observed sample sizes of the nine identifiable 
genotypes at the two loci. 


7.8 In one hybrid F, population derived from two heterozygous parents with size 
200, the following table gives the observed sample sizes of 16 identifiable pairwise 
genotypes for four complete markers, i.e., Mı, Ms, Mz, and My. 


Number Genotype M:-Mə M,-M3 M:-MA Mə-M3 Mə-M4 M3-M4 


1 AC-AC 0 1 2 3 13 2 
2 AC-AD 0 1 5 17 6 15 
3 AD-AC 4 2 5 15 3 18 
4 AD-AD 1 5 3 2 10 3 
5 AC-BC 5 16 8 0 5 1 
6 AC-BD 18 5 8 6 2 2 
7 AD-BC 23 3 12 6 6 1 
8 AD-BD 1 19 9 0 4 1 
9 BC-AC 3 11 4 1 5 0 
10 BC-AD 18 4 7 4 1 4 
11 BD-AC 19 6 11 1 1 2 
12 BD-AD 4 13 7 0 5 0 
13 BC-BC 0 T 6 3 13 12 
14 BC-BD 1 0 5 22 11 11 
15 BD-BC 2 1 5 18 T 17 
16 BD-BD 1 6 3 2 8 11 


(1) Estimate the female, male, and combined recombination frequencies between 
each pair of markers, and identify the linkage phases at the two loci in parents. 

(2) From the estimates of combined recombination frequencies, work out the 
marker order by the following method. Firstly, pick up the two markers with the 
smallest recombination frequency, one as the first marker, and the other one as 
the last marker. Then from the remaining markers, pick up the one having the 
smallest recombination frequency either with the first marker or with the last 
marker. If the marker has the smallest recombination frequency with the first 
marker, it is updated as the first marker. If the marker has the smallest 
recombination frequency with the last marker, it is updated as the last marker. 

(3) Based on the marker order determined in (2), use the Haldane mapping func- 
tion to calculate the map length between neighboring markers. 

(4) Based on the identified linkage phases and marker order in (1) and (2), rebuild 
the four haploid types of two heterozygous parents. 


7.9 Assuming that there is a QTL controlling one phenotypic trait in a double cross 
F; population, three genetic effects a, b, and dat the QTL are equal to 5, “2, and -1, 
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respectively. The population mean is equal to 50, and the random error variance is 
equal to 10. 


(1) Calculate phenotypic means of the four QTL genotypes. 

(2) Calculate genetic variance and the broad-sense heritability in the double cross 
F, population. 

(3) For one complete marker which is linked with the QTL with recombination 
frequency 0.2, calculate the phenotypic means of the four marker type classes. 


7.10 From the theoretical frequencies given in table 7.10 corresponding to linkage 
phases IT and III, calculate the maximum likelihood estimator of recombination 
frequency. Note: firstly, define a function of recombination frequency, i.e., 
t= r — r?, Denote the observed sample sizes of the nine identifiable genotypes as 
minə, and the total population size as n. The logarithm likelihood function is 
represented as follows, 


In L(r) x (m + ng + n + nz + ng) İn t+ (m+ na + ng + ng) In(1 — 26) 


Calculate the derivative of the above function on ¢, and let the derivative be 
equal to 0. The maximum likelihood estimator of £ can be acquired as 
j= ukmimimim Finally, the estimate of r can be obtained from the 
estimator of t. 


Chapter 8 


Genetic Analysis in Multi-Parental 
Pure-Line Progeny Populations 


In recent ten years, mating designs using multiple parents are becoming more and 
more common in genetic studies. Compared with bi-parental populations introduced 
in chapters 2-6, multi-parental populations contain a much larger genetic variation 
on the phenotypic trait, allowing for more QTLs and more alleles at each QTL to be 
detected. Similar to bi-parental populations, the issue of population structure does 
not exist, and the false positives caused by population structure can be avoided. 
According to the number of parents, there are many ways to conduct the 
multi-parental mating design. For example, when there are four parents, double cross 
(or four-way cross) is commonly used. When there are eight parents, an eight-way 
cross is commonly used, which is also called the multi-parent advanced generation 
inter-crossing (MAGIC). Sometimes, MAGIC can also refer to pure-line populations 
derived from a given number of parents different from eight. MAGIC populations 
have been reported in Arabidopsis thaliana (Kover et al., 2009), wheat (Mackay et al., 
2014: Verbyla et al., 2014; Huang et al., 2012), rice (Qu et al., 2020; Bandillo et al., 
2013), barley (Sannemann et al, 2015), soybean (Li et al, 2020), cowpea 
(Huynh et al., 2018) and so on. Four-way and eight-way crosses will be taken as 
examples in this chapter to introduce the linkage analysis and QTL mapping 
methods in multi-parental pure-line populations. For the mating design using 
four inbred parents, double cross and four-way cross have the same meaning, which is 
used alternatively. For the sections on linkage analysis, linkage map construction, 
and analysis software, the readers can also refer to Zhang et al. (2019). For the 
sections on QTL mapping, the readers can also refer to Zhang et al. (2017) and 
Shi et al. (2019). 
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8.1 Linkage Analysis in Four-Parental Pure-Line 
Populations 


8.1.1 Development Procedure and Marker Classification 
in Four-Parental Pure-Line Populations 


Genetic analysis methods for the double cross F; population derived from four 
inbred parents have been introduced in detail in §7.3 and §7.4 in chapter 7. In 
the double cross F) population, the genotype of each individual is unique and 
heterozygous. Genotypes are different among the individuals in the population 
and will change after the selfing propagation. Except for the clonal propagation, 
it is impossible to conduct the multi-environmental and replicated trials for 
phenotyping in double cross F, populations. However, if the doubled haploid 
technology or repeated selfing is applied on the double cross F, progenies, 
pure lines will be generated, similar to those in bi-parental DH or RIL popula- 
tions. Multi-parental pure-line populations can be repeatedly grown in 
multiple years and locations to increase the accuracy in phenotyping and the 
power in QTL detection. In addition, QTL-by-environment interaction analysis 
can be performed as well to identify the QTLs with stable genetic effects across 
environments. 

Figure 8.1 shows the procedure to develop the DH and RIL populations from 
four inbred parents. Two single crosses are first made from the four parents. Next, 
the four-way cross F, population is generated by crossing the two single crosses. 
Finally, the DH pure lines are produced by the pollen culture technology, or the 
RILs are produced by repeatedly selfing. As a population, each inbred parent 
contains only one homozygous genotype, which is called homogeneous and 
homozygous (figure 8.1). Single cross F,, as a population, contains only one 
heterozygous genotype, which is called homogeneous and heterozygous. In the 
double cross Fı progenies, different individuals have different heterozygous geno- 
types, resulting in a heterogeneous and heterozygous population (figure 8.1). In DH 
or RIL pure-line population derived from double F, by haploid doubling or 
repeatedly selfing, different lines have different homozygous genotypes, and the 
population is called heterogeneous and homozygous (figure 8.1). 

In one progeny of the four inbred parents, the alleles at one locus in the progeny 
can be traced back to unique parents only when four identifiable alleles are present 
at the locus, no matter whether the progeny is an individual in the double cross F; 
population as introduced in the previous chapter, or one pure line as introduced 
here in this chapter. Markers having such properties are called fully informative, or 
complete markers. However, when the number of identifiable alleles is smaller than 
four, some genotypes cannot be separated. Under this situation, each allele in 
progeny cannot be traced back to the only parent, and the number of homozygous 
genotypes is smaller than four as well. At one complete locus, the four alleles are 
denoted by A, B, C, and D. The pure-line progenies have four homozygous geno- 
types, denoted by AA, BB, CC, and DD, and have no heterozygous genotypes. 
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Homogeneous and 


. Inbred A X Inbred B InbredC X Inbred D 
homozygous population | | 
Homogeneous and Single cross AB x Single cross CD 
heterozygous population | 


Heterogeneous and 


heterozygous population Double cross F} progenies 


: : Repeated selfing and 
Haploid doubting 5 seed descent 


H d DH population RIL population 
h eterogeneous | R derived from the derived from the 
omozy gous population double cross F} double cross F} 


Fic. 8.1 — Schematic representation of the development of DH and RIL populations from four 
inbred parents. 


When no distortion occurs, the four homozygous genotypes follow the Mendelian 
ratio of 1:1:1:1. 

Taking the integrated software GAPL (Zhang et al., 2019) as an example, 
table 8.1 gives the names and genetic properties of 14 marker categories. In these 
categories, only Category 1 of markers are completely informative, which are named 
by ABCD and also called complete markers. Markers in other categories are 
incompletely informative, which are also called incomplete markers. For Categories 
2-7, alleles in two inbred parents cannot be separated, and the mixture genotype 
accounts for 50% of the progenies. Each genotype of the other two parental types 
accounts for 25% of the progenies (table 8.1). For Categories 8-10, alleles in two 
inbred parents cannot be separated, and alleles in the other two parents cannot be 
separated, either. There are only two identifiable genotypes in progenies, each 
accounting for 50% of the progenies (table 8.1). For Categories 11-14, alleles in three 
inbred parents cannot be separated, and the mixture genotype accounts for 75% of 
the progenies. The genotype of the other parental type accounts for 25% of the 
progenies (table 8.1). 

When using the software GAPL to conduct genetic studies in four-parental 
pure-line populations, the users have to specify the category of each marker. 
Genotypes in progenies are coded by letters from A to D, and the missing genotypes 
are coded as X. For each of the 14 categories as given in table 8.1, only the capital 
letters A, B, C, D, and X are the valid values on progeny genotypes. For example, 
for one marker belonging to Category 11 or ABBB (table 8.1), genotypes of pure 
lines in the population can be represented by any letter from A to D. The software 
will treat letters B, C, and D in the progenies as the same one identifiable genotype 
automatically, based on the specified category. The authors believe the marker 
classification and flexible genotypic coding implemented in the GAPL software are 
easier to be understood and to be accepted by the readers and users. 
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TAB. 8.1 — Categories of polymorphic markers in the four-parental pure-line populations. 


Category Name Relationship Identifiable Identifiable Mendelian 
on parental alleles genotypes segregation 
alleles ratio 

1 ABCD Fully separated 4 AA, BB, CC, DD 1:1:1:1 

2 AACD A=B 3 AA + BB, CC, DD 2:1:1 

3 ABCC C=D 3 AA, BB, CC + DD 1:1:2 

4 ABAD A=C 3 AA + CC, BB, DD 2:1:1 

5 ABCA A-D 3 AA + DD, BB, CC 2:1:1 

6 ABBD B=C 3 AA, BB + CC, DD 1:2:1 

7 ABCB B=D 3 AA, BB + DD, CC 1:2:1 

8 AACC A=B,C=D 2 AA + BB, CC+ DD 1:1 

9 ABAB A=C,B=D 2 AA + CC, BB+ DD 1:1 

10 ABBA A=D,B=C 2 AA + DD, BB+ CC 1:1 

11 ABBB B=C=D 2 AA, BB+ CC+ DD 1:3 

12 ABAA A=C=D 2 AA + CC + DD, BB 3:1 

13 AACA A=B=D 2 AA + BB+ DD, CC 3:1 

14 AAAD A=B=C 2 AA + BB+ CC DD 31 


8.1.2 Theoretical Frequencies of Genotypes and Estimation 
of Recombination Frequency at Two Complete Loci 


Consider two complete marker loci at first, and let A1, B,, Ci, and D, denote the 
four alleles at locus 1, and Az, Bə, C2, and Də denote the four alleles at locus 2. There 
are 16 homozygous genotypes in the pure-line progenies when the two loci are 
considered jointly. When the two loci are not linked, the 16 genotypes have an equal 
theoretical frequency. When considering linkage, the theoretical frequencies of the 16 
genotypes depend on recombination frequency between the two loci. Figure 8.2 
shows the diagram of population development and genotypes at two linked loci in 
double cross F; and pure lines derived from four inbred parents. 

Obviously, in the double cross F; population, one homologous chromosome in an 
individual progeny is generated by the recombination and crossover between the 
chromosomes of parents A and B, and the other homologous chromosome is gen- 
erated by the recombination and crossover between the chromosomes of parents C 
and D. There is no chance for recombination and crossover to happen between the 
chromosome of parent A or B and the chromosome of parent C or D. Female 
recombination frequency as has been introduced in chapter 7 measures the proba- 
bility of crossing-over at two loci between chromosomes of parents A and B; male 
recombination frequency measures the probability of crossing-over at two loci 
between chromosomes of parents C and D, and the combined recombination fre- 
quency can be treated as the average of female and male recombination frequencies. 
However, only the combined recombination frequency can be estimated in the 
pure-line populations, and we will come back to this issue later. 
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Inbred A Inbred B Inbred C Inbred D 
A A B B C C D D 
1 1 x 1 1 1 1 x 1 1 
4, 4, B, B, C) C) D, D, 
Single cross 4; B; Ci D, single cross 
AB x CD 
4, B, C; D, 


Double cross F, progenies 


A\AJC\C, AACD, AADC, AA/D,D, ABCC, ABACD, AB/D,C, AiBylDiD BACC, ByAşİCiD, B,4A/DiC, BADD, BB/C,C, B,B/C,D, B,B/D,C, B,B/D,D, 


ias haploids or 
Pure lines derived from the double cross [a repeated selfing 
A,AylA,A, A\ByA\By A\CyYA\Cy A\DyA\D, B,A;/B,A, BBy/B\B, B\CyB\C, B,D/B\D, C\Ax/C\A, CiByC,B, C,C/C,C, C,Dx/C\D, D,AyD,A, D,B,/D,B, D,C,/D,C, DiDyiDiD, 


Fic. 8.2 — Schematic representation of genotypes at two linked loci in pure-line progenies 
derived from four inbred parents. 


The concept of generation transition matrix as has been introduced in chapter 2 
has to be used as well to calculate the theoretical genotypic frequencies in both DH 
lines and RILs derived from the double cross F; population. To do this, firstly 
needed are the genotypic frequencies in the double cross F; population and the 
transition matrix from the double cross F; to pure-line progenies. In the double cross 
F, population, each individual is heterozygous (figure 8.2). During the meiosis to 
produce gametes, recombination and crossover can occur between any pair of the 
four parental chromosomes. Therefore, it is no longer to be able to separate the 
recombination frequencies in single crosses AB and CD in the derived DH and RIL 
populations. Only the combined recombination frequency can be estimated. The 
one-meiosis recombination frequency between the two loci is denoted as r. Based on 
the frequencies of four gametes generated by single crosses AB and CD, it is easy to 
acquire the theoretical frequencies of 16 genotypes in the double cross F; population, 
which are given in table 8.2. 

Table 8.3 gives the generation transition matrix from the double cross F; to DH 
pure-line progenies. Each row of the matrix gives the genotypic frequencies of DH 
lines derived from one genotype in the double cross F; population. The sum of the 
elements in each row is equal to 1. The blank element means the corresponding 
genotype does not exist, or its frequency can be viewed as 0. In the transition matrix 
given in table 8.3, it can be seen that each row has two non-zero elements 2 (1 — r) 
and two non-zero elements 2” and the remaining elements are all equal to 0. This 
can be easily understood from the genotypes included in the double cross Fy, 
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TAB. 8.2 — Theoretical frequencies of the 16 genotypes in the double cross F; population. 


Gametes generated Gametes generated by single cross CD and the frequencies 
by single cross AB and 1 1 1 
the frequencies Cı Cə, 2 (1 — r) C, Ds, ə” Dı Cə, 2 T D, Da, 2 (1 — r) 
1 1 1 1 1 2 
Aida 5- r) 3€- ry va r) va r) ril ry 
1 1 1 1 1 
A, By, zıl gaan 2 2 gra r) 
1 1 1, 1, 1 
Bida, ər yao” rul “əl gn 
1 1 1 1 1 ‘ 
B, Bo, z0 —r) ral — r)” null r) va r) A a-r? 


population. Taking genotype 4A:4ə/ Ci Cə at the first row as an example, four 
gametes will be generated by individuals with genotype A: A/C Cə during meiosis. 
Among them, A;A, and CC, are the two parental types each with the frequency 
(1 — r), and A,C, and C,Aə are the two recombinant types each with the frequency 
İr. Thus, in the DH lines derived by haploid doubling from individuals with geno- 
type 4:4ə/ Ci) Cə, frequencies of two homozygous genotypes A,A,A 2A. and 
Cı Cı Cə Cə are both equal to ¿(1 — r), and frequencies of two homozygous genotypes 
A,A,C Cy and C,C,A2Az are both equal to İr. Other homozygous genotypes are 
not present and therefore their frequencies are equal to 0. 

In table 8.3, the sum of the product between frequencies in the column corre- 
sponding to one of the 16 pure-line genotypes and the genotypic frequencies in the 
double cross F; population will give the theoretical frequency of the pure-line 
genotype. Theoretical frequencies of the 16 pure-line genotypes in the DH popula- 
tion are given in table 8.4. For example, the sum of the product between frequencies 
of the first column corresponding to the progeny genotype A,A,;A2A» and genotypic 
frequencies in the double cross F, population is given by, 

1 1 2.1 1 1 2 1 
r(1 — r) qü r) =; 


(1-7 


which is the theoretical frequency of homozygous genotype A,A;A2Az (i.e., ALA) at 
locus 1 and AŞA? at locus 2), i.e., 1(1— r)” as shown in the first row and first column 
in table 8.4. 

In table 8.3, if the one-meiosis recombination frequency r is replaced by the 
cumulated recombination frequency during the repeated selfing, i.e., R, the transi- 
tion matrix from the double cross F; to the RIL population is obtained, which will 
not be given separately. Equation 8.1 shows the relationship between the two 
recombination frequencies, which is actually the same as equation 2.7 in chapter 2. 

R 2r R 


2 1 
liao ” "O R) - 


TAB. 8.3 — Generation transition matrix from the double cross F; to DH pure-line progenies. 


Genotype in the Frequency Genotype in the DH pure-line progeny population 
double cross F, A, A, AgAg A, A;BoBy A,A, Cy Cy A, A; D 2D, B,B, A249 Bı Bı Bo Bs BB, Cə Cə B,B,DzDo 
1 2 1 1 
Ay Ao/C\ C2 rli r) 2 (1-— r) ə" 
1 1 1 
A,4ə/ C1 D2 rac r) . r) -r 
1 1 1 
Aj Ao/D, Cy qü r) 2 (Leen) əT 
1 2 1 1 
A,Aə/ DiDə 4 (1 ry 2 (-r) 5° 
AıB,/ C.C, 10 -) 50” 
2/ Ci Cə x T 5 r ə” 
1 2 1 1 
A, By/ CD» ölə r)? od”) ə” 
1 2 1 1 
AıB2/ Dı Cy 7670 zr) "al 
1 1 1 
A,B /D,D, rad =r) z0 =r) pul 
1 1 1 
By A2/ C1 C2 "ül z077 2” 
1 2 1 1 
B, Aə/ C Də ral — r)” z0 x r) sf 
1 2 1 1 
By A2/D, C> race = r)? g(t “m əT 
1 1 1 
BiAə/ D: Də qü sən) 5 (1 m ə” 
Bı By/ C1 C> la — r} la —r) . 
4 2 2 
B,Bo/C,D an TA 1 
1 Bo/ C De 7A r 5 r 5 r 
1 1 1 
Bı B/D: Cy ula 20-r) sT 
1 2 1 1 
B, By/D, D2 a7?) gut) 3” 
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Tas. 8.3 — (continued). 


Genotype in the Frequency Genotype in the DH pure-line progeny population 
double cross F, Cı CLA24ə2 CiCiBəBə OC, C.C CıCıDəDə> D,D\A2A, DıDıBəB> DiD,C.C, DiDiDəD, 
iL 1 
AAs CO, 41” ar sü 
1 1 1 
A, Ao/C,D, pan 2” z047 
44/D:C 00 1 Las 
2/D1 C2 ree Us z” 5 r 
1 2 1 1 
Aj Ao/D, D> 24) ə” z307”) 
1 1 1 
A, B/C, C2 gra?) 2” 20-r) 
1 1 1 
AıBə/ C1 Də Adem 5" 207) 
1 2 1 1 
A, Bo/D, C> 40-90 5” z307” 
A,Bə/D,D (1-1) mn Ten 
2/Dı Də 4 2 2 
1 1 1 
By Ao/ C1 C2 ya” z3” z307 
1 2 1 1 
BiAş/ CD, rel -r z” əd —r) 
1 2 1 1 
By Ao/D, C qr z” 307”) 
B, A>/D,D 0 b lisp 
142/Dı Də A” T ə” z r 
1 1 1 
B, Bz/ C.C, 200) 5" get) 
1 1 1 
Bı B2/ C, Də gaan z” 207” 


GGE 


Surddeyy əuər) pue sısAyeuy əSeyur? 


Tas. 8.3 — (continued). 


B.B./D.C Ltn i : la- 
1 2/ "Cə ral T z” 9 T 

1 ə 1 1 
B.Bə/ D,D» 27” er 5(l-1) 


Note: A), Bı, C1, and D; are the four alleles at locus 1, and A», By, C2, and Dy are the four alleles at locus 2. The one-meiosis recombination 
frequency between the two loci is denoted as r. If r in the table is replaced with R as defined in equation 8.1, the transition matrix is the one 
from the double cross F; to RIL pure-line progenies. 
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TAB. 8.4 — Theoretical frequencies of the 16 genotypes at two linked loci in the four-parental 
DH progeny population. 


Genotype at locus 1 Genotype at locus 2 
ALA 14-07) 161-) (2) İr (3) 1-4) 
dı. 4 4 8 8 

1 1 1- 1 

BiB, 70-06) 0-0” 6) rO Sr (8) 
1 1 1 2 1 

C.C, 3 (9) s” (10) nl — r) (11) rue, — r) (12) 
1 1 Lə s la a? 

D, Dı 5 (13) 3 (14) rul r) (15) qd r) (16) 


Note: the number in brackets (i.e., 1-16) behind the theoretical frequency represents order of 


the corresponding genotype. 


Using the transition matrix from the double cross F; to the RIL population, 
theoretical frequencies of genotypes in the RIL population can be calculated, which 
are given in table 8.5. Also taking the progeny genotype 4:4:4343ə as an example, 
its theoretical frequency is calculated by, 


TAB. 8.5 — Theoretical frequencies of the 16 genotypes at two linked loci in the four-parental 
RIL progeny population. 


Genotype Genotype at locus 2 
at locus 1 AA, BoBo QO D:D» 
1 3 
478” 
ALA, or Ip r İn r lp r 
1-r 8” a0 42 8 AG +2r) 87” A 421) 
4(1 + 2r) 
1 
gk or 
1 3 l-r 1 r 1 r 
By By r Ror R or R or 
4(1--2r) 4 8 4(1+2r) 8 4(1+ 2r) 8 4(1+ 2r) 
1 
şü OT 
1 r 1 3 l-r 1 r 
CC r R or Ror R or 
41+ 2r) 8 4(14 2r) 4 8 4(1+2r) 8 4(1+ 2r) 
1 
gh or 
1 Y 1 r 1 3 l-r 
DD r Ror R or R or 
401127) 8 A(1+ 2r) 8 A(1+ 2r) 4 8 A(1+ 2r) 


Note: Orders of the 16 homozygous genotypes in the RIL population are the same as those 
given in table 8.4, which are not indicated in the table. 
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From equation 8.1, the theoretical frequency of homozygous genotype 4: 4: 45A, 
(i.e., ALA) at locus 1 and AA» at locus 2) can be further represented by 1 — $ R or 
.. 
4(1 sor) 
Let n be the size of the four-parental pure-line population, and m—nış be the 
observed numbers of the 16 genotypes as given in table 8.4 or 8.5. Likelihood functions 


are given in equations 8.2 and 8.3 for DH and RIL populations, respectively. 


, as shown in the first row and first column in table 8.5. 


Lx (m+ ne + m + mis) (1 = rpm H ns + M41 + me) + Ng + ns + My + nis (DH population) 
(8.2) 
1 3 mi + ne + n + me 1 n— (mi + ne + mı + nis) 
Lx (; z r) (; r) (RIL population) (8.3) 


Calculate the logarithm likelihoods, i.e., lnL, from equations 8.2 and 8.3, and the 
derivatives of lnL on r, i.e., diz, Solving the likelihood equation by setting diz =0 
will acquire the maximum likelihood estimate of combined recombination frequency, 
which is given in equations 8.4 and 8.5 for the two kinds of pure-line populations, 
respectively. In the RIL population, the estimator of the cumulated recombination 
frequency R is acquired first, and then the one-meiosis recombination frequency r is 
acquired by using equation 8.1. 


n— (mi + ne + m1 + nse) 
n+ ni + n + ns + NG + mi + mi? + nis + nie 


r= 


(DH population) (8.4) 


2[n — (m + ne + mi + mis)l 


R= | 

n — (mu . mi + nis) eS) 
= ” ”— (RIL population) 

n--2(m + ne + m1 + mis) 


Given in equations 8.2 and 8.3 are the likelihood functions under the alternative 
hypothesis Hy: r # 0.5. By substituting the estimates from equations 8.4 and 8.5 
into equations 8.2 and 8.3, the maximum likelihoods under hypothesis H4 can be 
obtained for the DH and RIL populations, respectively. In equations 8.2 and 8.3, 
letting r = 0.5, the maximum likelihoods under the null hypothesis Ho: r = 0.5 (i.e., 
the two loci are not linked) can be obtained. Therefore, the LRT statistic and LOD 
score can be calculated and then used to determine the significance of the linkage 
relationship. The readers can refer to chapters 2 and 4 for details on the calculation 
of test statistics and how the inference can be made. 


8.1.3 Estimation of the Recombination Frequency 
Involving Incomplete Markers 


Based on the 14 marker categories as shown in table 8.1, a total of 105 scenarios have 
to be considered to estimate r between two markers of the 14 categories. Theoretical 
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frequencies of the identifiable genotypes in each scenario can be derived from 
table 8.4 or 8.5 by combining the un-separated genotypes, and then used to calculate 
the maximum likelihood estimation of recombination frequency. Suppose one mar- 
ker belongs to Category ABCD and the other one belongs to Category AACD. 
There are four identifiable alleles at locus 1, i.e., A1, Bı, C1, and Di, there are three 
identifiable alleles at locus 2, i.e., Aş + Bo, Cə, and Də. The pure-line progeny 
population has 12 identifiable genotypes when the two loci are considered together, 
and their theoretical frequencies are shown in table 8.6. Observed sizes of the 12 
identifiable genotypes are represented by m—mypg, and the size of the whole progeny 
population is represented by n. As a practice, the readers are encouraged to cal- 
culate the maximum likelihood estimation of recombination frequency based on the 
theoretical genotypic frequencies given in table 8.6 (see also exercise 8.1). 


TAB. 8.6 — Theoretical frequencies of the 12 identifiable marker classes when one locus 
belongs to Category ABCD and the other one belongs to Category AACD. 


Marker Identifiable genotype Frequency in Frequency Observed 

class Locus 1 Locus 2 the DH in the RIL sample 
population population size 
1 1 1 

1 ALA) Ay Ag + By By ral E r) a‘ R) or A(1 + 2r) Ty 
1 1 T 

z Ali QQ 3” 34" FI mə 
1 1 R T 

. Adı dır a” 87577 IOF?) mə 
1 1 1 

4 BB, Ao Apo + By By au m r) qü R) or A(1+ 2r) Tha 
1 1 

? BB QQ 8” 100 mə 
1 1 r 

6 BiB D,D, 87 gan ms 
1 1 r 

7 Cı C AzÁə + Bo Bo 4 r ra or ”ü a 2r) Mz 
1 2 1 3 l-r 

8 CC Cə Cz qa r) 4 7 4(1+2r) . 
1 a ) 1 R T 

9 Cı C D,D, 4 if r 8 or 4(1 ER 2r) Mo 
1 1 r 

10 DD, A.A, + By By ül raz OT 21 ə 2r) Mio 
1 1 r 

11 Dı Dı Cə Cə ruc m r) geor 4(1+ 2r) M11 
1 2 1 3 l-r 

12 DD, D2D, 70 mi r) 4 gt or 4(1 +2r) Tə 


However, it is tedious to derive the theoretical frequencies of identifiable geno- 
types and work out the succinct solution on recombination frequency for each of the 
105 scenarios. Much more scenarios occur when more inbred parents, such as eight, 
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are used in the mating design. Introduced below is the use of the EM algorithm in 
recombination frequency estimation in four-parental pure-line populations. 

In the EM algorithm, an initial value is first assigned to the recombination fre- 
quency to be estimated. Then the observed sample sizes of 16 complete marker types 
are calculated from the expected frequencies of complete marker types included in the 
identifiable marker classes. The recombination frequency is re-calculated in the same 
way as two complete markers, i.e., by equation 8.4 or 8.5, and then used as the new 
initial value for the next iteration. This procedure continues until the difference in 
r between two succeeding iterations reaches a pre-defined precision, for example, 
0.0001 by default in the software GAPL. The final value when the EM iteration stops 
is the maximum likelihood estimate of recombination frequency. It is worth men- 
tioning that, the estimate acquired by the EM algorithm is the same as the estimate 
by solving the likelihood equation. The EM algorithm cannot acquire the variance of 
the estimate. To obtain the variance of the estimated recombination frequency, 
first-order and second-order derivatives of the logarithm likelihood function have to 
be calculated, and the Fisher information introduced in chapter 2 has to be used. 

Take again the two markers given in table 8.6 as an example to demonstrate the 
application of the EM algorithm in estimating the recombination frequency in 
four-parental pure-line populations. It can be easily seen that the identifiable marker 
classes 2, 3, 5, 6, 7, and 10 given in table 8.6 are the recombination types between the 
chromosomes of parents A and B and the chromosomes of parents C and D. 
Therefore, the ratio of these genotypes in the population can be used as the initial 
value of recombination frequency, i.e., equation 8.6. 

mə + M3 + məş + Me + M7 + Mio 
m= (8.6) 


n 


It can be seen from table 8.6 that some marker classes correspond to single 
complete genotypes, but some classes are the combinations of two complete geno- 
types as given in table 8.4 (for DH) or table 8.5 (for RIL). Table 8.7 shows the 
relationship between the identifiable marker classes defined in table 8.6 and the 16 
complete genotypes. It can be seen that eight identifiable classes as defined in 
table 8.6, i.e., 2, 3, 5, 6, 8, 9, 11, and 12, are equivalent to complete genotypes 3, 4, 7, 
8, 11, 12, 15 and 16, respectively, as defined in table 8.4 or 8.5. Therefore, the 
observed sample sizes of the eight classes are exactly the same as those of the eight 
complete marker types, 


TAB. 8.7 — Relationship between the 12 identifiable marker classes (1-12) and 16 complete 
marker types (1-16) at two markers, one of Category ABCD and one of Category AACD. 


Allele Complete marker types Identifiable marker class 

A B C D A B C D 
A 1 2 3 4 1 1 2 3 
B 5 6 7 8 4 4 5 6 
C 9 10 11 12 7 7 8 9 
D 13 14 15 16 10 10 11 12 
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nə = mə, n4 = miş, n7 = mi, ng = mü, (8 7) 


M1 = Mg, Ti = Mg, M5 = M1, M16 = M12 


The other four marker classes as defined in table 8.6, i.e., 1, 4, 7, and 10, are 
combinations of two genotypes as defined in table 8.4 (for DH) or table 8.5 (for RIL) 
(see also table 8.7). In the DH population, the first identifiable marker class is a 
combination of complete genotypes 1 and 2 in table 8.4, with the ratio of 
1(1 — r)” :19(1 — r) = (1 — r) : r. So the observed sample size m; is split to sample 
sizes of two complete genotypes, i.e., m and nə, by ratio (1 — rọ) : ro. The fourth 
identifiable marker class is the combination of complete genotypes 5 and 6 in 
table 8.4, with the ratio of r : (1 — r). So the observed sample size ma is split to 
sample sizes of two complete genotypes, i.e., ng and ng, by ratio ro : (1 — ro). For 
marker class 7 or 10, the ratio of the two included complete genotypes is 1:1. So mz is 
split to ng and nyo by a ratio of 1:1, and mü is split to miş and n4 by a ratio of 1:1. In 
the RIL population, the four marker classes 1, 4, 7 and 10 have same ratios on the 
included complete genotypes as those in the DH population, which is left for the 
readers to confirm. Therefore, sample sizes of the other eight complete genotypes 
can be estimated by equation 8.8 for both DH and RIL populations. 


m = (1 — m)mi, nə = mmi, ns = roma, NM = (1 — m)ma, 


ng = 0.5mz, Mo = 0.5mz, m3 = 0.5m 0, M4 = 0.5m 19 —. 


As described above, equations 8.7 and 8.8 represent the F-step in the EM algo- 
rithm, i.e., set an initial value to recombination frequency, calculate the expected ratio 
of complete genotypes included in incomplete marker classes, and convert the 
incomplete data into complete data by the expected ratio of complete genotypes. 
When the sample sizes of 16 complete genotypes are acquired from equations 8.7 and 
8.8, recombination frequency is re-estimated by equation 8.4 or 8.5 same as the 
complete data, and then used as the new initial value. The procedure to re-calculate 
the recombination frequency according to the method of two complete markers is the 
M-step in the EM algorithm. Table 8.8 gives the first five iterations of the EM algo- 
rithm in the estimation of recombination frequency for two markers of Cate- 
gories ABCD and AACD in a four-parental RIL population of size 200. The first two 
columns show the sample sizes of 12 identifiable marker classes at the two loci, where 
the order of marker classes is the same as given in table 8.6. The initial value is equal to 
0.29 calculated by equation 8.6. The second part of table 8.8 shows the sample sizes of 
16 complete genotypes in the first five iterations (E-step). Given at bottom of the table 
are the estimates of recombination frequency calculated by equation 8.5 (M-step). 

It can be seen from table 8.8 that after five times of iterations, the estimated value 
tends to be stable, and so are the observed sample sizes of the 16 complete genotypes. 
The readers may want to confirm that for the observed numbers of 12 identifiable 
classes as given in table 8.8, any initial value between 0 and 1 in the EM algorithm can 
converge to 0.1873 in just a few rounds of iterations. Therefore, for the two marker 
categories in table 8.8, the EM algorithm can converge to the estimated 
recombination frequency rather quickly for a wide range of initial values. In reality, 
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TAB. 8.8 — Estimation of recombination frequency by the EM algorithm for two markers of 
Categories ABCD and AACD. 


Identifiable Observed Complete Iteration in the EM algorithm 


marker class sample size marker 1 2 3 4 5 
class 
1 26 1 18.46 20.54 21.00 21.10 21.13 
2 T 2 7.54 5.46 5.00 4.90 4.87 
3 4 3 ib 7 7 7 u 
4 42 4 4 4 4 4 4 
5 5 5 12.18 8.82 8.07 7.91 7.87 
6 6 6 29.82 33.18 33.93 34.09 34.13 
7 21 T 5 5 5 5 5 
8 30 8 6 6 6 6 6 
9 6 9 10.5 10.5 10.5 10.5 10.5 
10 15 10 10.5 10.5 10.5 10.5 10.5 
11 5 11 30 30 30 30 30 
12 33 12 6 6 6 6 6 
13 7.5 7.5 7.5 7.5 7.5 
14 7.5 7.5 7 7.5 7.5 
15 5 5 5 5 5 
16 33 33 33 33 33 
Estimated r 0.2900 0.2100 0.1921 0.1883 0.1875 0.1873 


convergence speed in the EM algorithm depends on categories of incomplete markers. 
For example, for the two categories ABCC and AAAD of markers in exercise 8.4, it 
will take tens of iterations to converge. Generally speaking, the more incomplete 
information is included in two markers, the fewer identifiable alleles are there in 
parents, the fewer identifiable genotypes are there in the progeny population, and the 
slower the convergence speed will be in the EM algorithm. Nevertheless, the EM 
algorithm is always effective in estimating the recombination frequency when 
incomplete markers are involved in the four-parental pure-line populations. 


8.1.4 Situations When Number of Inbred Parents Smaller 
Than Four 


When parent A is the same as parent B, or parent C is the same as parent D, the 
double cross is denoted as (A x A) x (C xD) or (A x B) x (C x C), which is 
equivalent to a three-way cross F4, i.e., A x (C x D) or (A x B) x C. When parent 
A is the same as parent C or parent A is the same as parent D, only three parents 
are involved in the double cross, which is denoted as (A x B) x (A xD) or 
(A x B) x (C x A). When parent A is the same as parent C, and parent B is the 
same as parent D, only two parents are involved in the double cross, which is 
denoted as (A X B) X (A X B). The above situations can be treated as special cases 
of double cross Fj, and the pure-line progeny populations developed under such 
situations can also be analyzed by the same methods as introduced previously. 
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For example, in the three-way cross A X (C x D), all markers belong to Category 
AACD; in double cross (A X B) x (A x D) with three parents, all markers belong 
to Category ABAD; in double cross (A x B) x (A X B) with two parents, all 
markers belong to Category ABAB. 

It can be seen that the double cross F; from only two parents, i.e., (A x B) x 
(A x B), is in fact the hybrid F, population of the two parents. So, if wanted, the 
bi-parental RIL population can also be viewed as a special case of the four-parental 
RIL populations in genetic analysis. If the pure-line population is developed by 
doubling the gametes of the bi-parental hybrid F), the bi-parental DH population 
cannot be viewed as a special case of the four-parental DH population in genetic 
analysis. However, if the DH lines are developed by doubling the gametes of the 
bi-parental hybrid F3, the bi-parental DH population can be viewed as a special case 
of the four-parental DH population. Hopefully, the readers can understand this 
point more clearly by working out exercises 8.5 and 8.6. 


8.2 Linkage Analysis in Fight-Parental Pure-Line 
Populations 


8.2.1 Development Procedure and Marker Classification 
in Eight-Parental Pure-Line Populations 


Figure 8.3 shows the development procedure of DH and RIL populations from eight 
inbred parents. Four single crosses are firstly made from the eight inbred parents. 
Secondly, two four-way crosses are generated from the four single crosses. Thirdly, 
the eight-way cross is made between the two four-way crosses, which are highly 
heterogeneous and heterozygous. And finally, the DH lines are produced by the 
pollen culture technology, or the RILs are produced by repeated selfing (figure 8.3). 
At one genetic locus, the eight parental alleles are denoted by A-H. The pure-line 
progenies have eight homozygous genotypes denoted by AA—HH. No heterozygous 
genotypes are present. When no distortion occurs, the eight homozygous genotypes 
each account for one-eighth of the progeny population. When considering two loci 
together, the pure-line progenies have a total of 64 homozygous genotypes. When 
the two loci are not linked, the 64 genotypes have the equal theoretical frequency. 
When considering linkage, the theoretical frequencies of the 64 genotypes depend on 
the recombination frequency between the two loci. 

In figure 8.3, each single cross F; has only one heterozygous genotype. Therefore, 
as long as enough hybrid seeds can be produced for the next generation, the double 
cross ABCD can be generated by the hybridization between one individual in single 
cross AB, and one individual in single cross CD. Similarly, the double cross EFGH 
can be generated by the hybridization between one individual in single cross EF, and 
one individual in single cross GH. A double cross F: population is heterogeneous and 
heterozygous (figure 8.3), i.e., different individuals in the population have different 
heterozygous genotypes. To maximize the effective size of the eight-way cross F, 
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Fic. 8.3 — Schematic representation of the development of DH and RIL populations from 
eight inbred parents. 


population, the ideal situation is to use as many individuals in two double-crosses 
ABCD and EFGH as possible, and only one seed is harvested from each 
hybridization and then advanced to the next generation. In other words, each 
individual in the eight-way cross Fı population can be traced to one individual in 
double cross ABCD and one individual in double cross EFGH, which are used as two 
parents to derive the individual. In developing pure lines by repeated selfing, the 
same is true when using the single-seed descent. Each pure line thus generated can 
be traced back to one single individual in the eight-way cross F; population, by 
which the maximum effective population size is maintained (Wang, 2017). 

Though the multi-parental populations have some advantages in genetic studies, 
the readers should also be able to see from this chapter that more parents bring more 
alleles and more genotypes at each locus in the progeny population, which will 
complicate the genetic analysis methods. In addition, two loci have to be considered 
at the same time in linkage analysis, and three loci have to be considered at the same 
time in QTL mapping. If the number of homozygous genotypes at one single locus is 
eight, there are 8° = 64 homozygous genotypes at two loci, and 8° = 512 homozy- 
gous genotypes at three loci. Heterozygous genotypes are not included in pure-line 
populations, as focused on in this chapter. Sometimes, the heterozygous genotypes 
have to be present in the population, so that the dominance-related genetic effects 
can be investigated. In this situation, the number of genotypes to be considered is 
much larger. Obviously, the genetic analysis methods become more and more 
complicated as the increase in a number of parents. 

It can be seen from figure 8.3 that during the development of eight-parental 
pure-line populations, two four-parental pure-line populations can be generated by 
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haploid doubling or repeated selfing from the two double cross F, populations 
(analysis methods can be found in §8.1); four bi-parental pure-line populations can 
be generated by haploid doubling or repeated selfing from the four single cross F; 
populations (analysis methods can be found in chapters 2-6). Thus, different types 
of pure-line populations, such as bi-parental, four-parental, and eight-parental, can 
be produced and used together in systematic genetic research, which also provides 
the opportunity to confirm the genetic analysis results with each other from different 
types of populations. 


8.2.2 Marker Classification and Genotypic Coding 
in Eight-Parental Pure-Line Populations 


In an eight-parental pure-line population, if there are eight identifiable alleles or 
marker types at one locus, alleles in the progenies can be traced to their parental 
origins. Such a locus is called fully informative (or complete locus) and is denoted by 
ABCDEFGH. If the number of alleles is smaller than eight, some genotypes cannot 
be separated, and therefore cannot be traced back completely to parental origins. 
The number of identifiable genotypes is also smaller than eight. Based on the 
number of identifiable alleles in parents and the homozygous progenies, a total of 
4139 categories can be differentiated, which is hard to give them all here in detail. 
Take the integrated software GAPL (Zhang et al., 2019) as an example to illustrate 
how such a large number of marker categories can be recognized. 

In the GAPL software, each marker category is represented by a string with eight 
letters chosen from A-H. One letter in a string can appear once or more than once. 
For example, string ABCDEFGH is given to represent the fully-informative mark- 
ers, where the alleles of all parents are fully identifiable with each other. 
String AACDEFAC is given to represent that the alleles from parents A, B, and G 
are the same, and meanwhile, the alleles from parents C and G are the same. To use 
the eight-lettered strings to properly represent all categories of polymorphic markers 
in the eight-parental pure-line population, some rules have to be followed, which are 
briefly described below. 

The first position in the eight-lettered string can only be letter A; the second 
position can be either letter B or letter A; the third position can be letter C or either 
one which has appeared in previous two positions; the fourth position can be letter D 
or anyone which has appeared in the previous three positions, and so on for the fifth 
to eighth positions. All possible categories of polymorphic markers in the 
eight-parental pure-line populations can be properly defined and differentiated by 
the naming rules described above. For example, string AACDEFGH indicates that 
parents A and B carry the same allele, and alleles from the other six parents are fully 
identifiable; string AAADEFFH indicates that parents A, B, and C carry the same 
allele, parents F and G carry the same allele, and alleles from the other four parents 
are fully identifiable with each other; string ABCCCCCH indicated that parents 
C-G carry the same allele, and alleles of the other parents are fully identifiable with 
each other; and so on. The strings mentioned above meet the naming rules and 
therefore are effective in representing the specific marker categories. As an example, 
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string AACDEFBH is not valid for any marker categories, since the letter at the 
seventh position, i.e., B, does not appear in the first to sixth positions. This string 
will be treated as invalid by the GAPL software. The users have to modify it as 
AACDEFAH, AACDEFCH, or any other valid string, based on the polymorphism 
identification at the locus in eight parents and the progeny population. 

When the genetic analysis is conducted in an eight-parental pure-line population 
by the GAPL software, the users have to indicate the category of each marker by the 
naming rules given above. Genotypes of progenies are coded by letters A to H, and 
the missing genotypes are coded by X. For each marker category, A, B, C, D, E, F, 
G, H, and X are the only valid values of genotypes, and any other values are treated 
as invalid by the software. For example, for a marker belonging to Cate- 
gory ABCCCCCH, the progeny genotypes may be coded by D, E, F, and G as well. 
From the specified category, the software can recognize this marker which has only 
four identifiable genotypes and automatically combine the genotypes coded by C, D, 
E, F, and G into one identifiable homozygous genotype in linkage analysis. The 
authors believe that marker classification and the flexible genotypic coding as 
implemented in the GAPL software are easier to be understood and to be accepted 
by the readers and users. 

As a matter of fact, similar naming rules have been followed in the representa- 
tions of 14 marker categories in the four-parental pure-line populations. As given in 
table 8.1, each string has four letters chosen from A-D. The readers can check that 
for all category names listed in table 8.1, the first position is always letter A; the 
second position is either B or A; the third position is either C, or either one appeared 
in the first and second positions; the fourth position is either D or anyone appearing 
in the first to third positions. In addition, the number of different letters included in 
each four-lettered name of the category is equal to the number of identifiable 
homozygous genotypes in four-parental pure-line populations; the number of dif- 
ferent letters included in each eight-lettered name of the category is equal to the 
number of identifiable homozygous genotypes in the eight-parental pure-line 
population. 


8.2.3 Theoretical Frequencies of Genotypes 
at Two Complete Loci 


Assuming there are eight identifiable alleles at each locus, the eight-parental 
pure-line population has eight identifiable homozygous genotypes at one complete 
locus. When two complete loci are considered together, there are a total of 64 
homozygous genotypes in the progeny population, and their frequencies depend on 
the recombination frequency between the two loci. To acquire the theoretical fre- 
quencies of homozygous genotypes in the eight-parental pure-line population, firstly 
needed are the theoretical frequencies of genotypes in the eight-way cross F, pop- 
ulation. The ABCD double cross F; will generate 16 haploid types of gametes. By 
haploid doubling, one four-parental DH population is produced, which has been 
introduced in §8.1. Therefore, the frequencies given in table 8.4 are also the 
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theoretical frequencies of the 16 types of gametes generated by the ABCD double 
cross F,, which are re-organized in table 8.9. 

The EFGH double cross F; will also generate 16 haploid types of gametes, having 
the same frequencies as the gametes generated by the ABCD double cross F4, but 
carrying different alleles (table 8.9). Hybridization between the two double-cross F, 
populations is equivalent to the random unit between the 16 types of gametes from 
double-cross ABCD and the 16 types of gametes from double-cross EFGH, resulting 
in a total of 16 X 16 = 256 progeny genotypes. It is not hard to calculate the 
theoretical frequencies of the 256 genotypes in the eight-way cross population from 


TAB. 8.9 — Theoretical frequencies of the 16 haploid types of gametes at two complete loci 
generated by two double crosses ABCD and EFGH. 


Gamete generated by the Theoretical Gamete generated the Theoretical 

ABCD double cross F, frequency EFGH double cross F, frequency 
1 1 

A, Ay au m r}? EEs qü = ö 
1 1 

A, By va —r) E Fə rul —r) 
1 1 

A, Cy 8 K E Gə 8 T 
1 1 

AıDə 8 a EH 8 2 
1 1 

BA» rul =r) F Ea rul =) 
1 1 

B, Be 40-0) Ah "Li 
1 1 

Bı C 87 F.G, 3” 
1 1 

B,D» 8 r F.H, 8 T 
1 1 

CA 8 r GE» 8 T 
1 1 

Ci, By 8 T G, Fy 8 r 
1 1 

Ci Cy ri G Gə qün” 
1 1 

CıDə 4 T1—r) G,Hə qü =r) 
1 1 

Di A2 8 T H Ey 8 T 
1 1 

Dı Bo 8 r A, Fo 8 r 
1 1 

Dı C 470- r) Hı Gə yt- mi 
1 1 

DiD» 40-77 Hö, 21” 
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the theoretical frequencies of gametes as given in table 8.9, which will not be listed 
here in detail. 

Use the generation transition matrix similar to the one given in table 8.3 to 
calculate the genotypic frequencies in the eight-parental pure-line population. The 
transition matrix has 256 rows, corresponding to the 256 genotypes in the 
eight-way cross F, population, and 64 columns, corresponding to the 64 
homozygous genotypes in the eight-parental pure-line populations. Similar to 
table 8.3, each row of the transition matrix from the eight-way cross F; to DH 
lines has two non-zero elements of 1(1 — r) and two non-zero elements of 2 r, and 
the remaining elements are equal to 0. Each row of the transition matrix from the 
eight-way cross Fi to RILs has two non-zero elements of 4(1— R) and two 
non-zero elements of 5R, and the remaining elements are all equal to 0, where 


R= eer as given in equation 8.1. 

Take genotype A; A,/ EE as an example (table 8.9), which is generated by the 
unit of gametes A; 4ə and F, Fy. Individuals having this genotype will produce four 
haploid types of gametes during the process of meiosis. Among them, A, A» and EF, Eş 
are the two parental types each with the frequency of 1(1 — r), and A; F> and E.A? 
are the two recombinant types each with the frequency of ir. So in the transition 
matrix from the double cross F; to DH lines, elements corresponding to homozygous 
genotypes 4:4: 42AŞ and EEEF are both equal to (1 — r); elements corre- 
sponding to homozygous genotypes 4: 4: E2E and EE, AA are both equal to 27: 
elements corresponding to the remaining genotypes are all equal to 0. Using the 
genotypic frequencies in the eight-way cross F) population and the generation 
transition matrix to DH lines, theoretical frequencies of the 64 homozygous geno- 
types in the eight-parental DH population can be calculated, which are given in 
table 8.10. 

Take again genotype 4:43 / EE, as an example (table 8.9), which is generated by 
the unit of gametes A) Aş and FE. Individuals having this genotype will produce 
four homozygous genotypes during repeated selfing. Among them, genotypes 
AAAA and EEEE are two parental types each with the frequency of 
1 — R), and A: A, EŞ Fə and Fi B,AŞAŞ are the two recombinant types each with the 
respective frequency of 5 R. Therefore, in the transition matrix from double cross F; 
to the RIL population, elements corresponding to homozygous genotypes A, A;A A» 
and E,E, EE» are both equal to 1(1 — R); elements corresponding to genotypes 
A, A, BoE, and EE AA are both equal to 5R; elements corresponding to the 
Using the genotypic frequencies in the eight-way cross F; population and the gen- 
eration transition matrix to RILs, theoretical frequencies of the 64 homozygous 
genotypes in the eight-parental RIL population can be calculated, which are given in 


table 8.11. 


remaining genotypes are all equal to 0; where R= as given in equation 8.1. 


TAB. 8.10 — Theoretical frequencies of the 64 homozygous genotypes at two linked complete loci in the eight-parental DH population. 


Genotype 
at locus 1 


ALA, 
BB, 
CC, 
DD, 
E E 
FF; 
GiGi 


AL, A, 


Genotype at locus 2 


AAş By By Cə Cə DD» EF PAF, Gy Gy HH, 

1 : 1 1 1 1 1 1 1 

5" — r)” - r)” i r(1 — r) 16 r(1 — r) gər gi gar m" 

gril =)" gl -= r)’ 16 r(1 — r) 16 r(l-— r) 39 3? 32 32 

n r(1— r) a r(1— r) sa r? a — r)” 5r ə" ə" ə" 

n r(1 — r) L r(1 — r) 5 r” la — r)” =r a" =" =" 

1 1 1 1 1 1 1 1 

ga" aa" ga" ga" süz r” a r” i r(1—r) i r(1 — r) 
n ə 3 ga" grt) 0 16" r) mi r) 
ga" gi" m ga" i r(1 — r) 16 r(1— r) s r? şu — r)” 
sə” sə” 39° 327 16 r(1 — r) 16 r(l — r) gril ry şü — r)” 


99€ 


Surddeyy əuəo5 pue sısAyeuy əSeyur? 


TAB. 8.11 — Theoretical frequencies of the 64 homozygous genotypes at two linked complete loci in the eight-parental RIL population. 


Genotype Genotype at locus 2 
at locus 1 AyAs By Bo Cə Cə Dy Do EEs EF. Go Gə Ay Hp 
(1 _ r}? r(1 — r) r F T T T T 
A, Ay (Lt 2r) 8(1 4+ 2r) 16(1+ 2r) 16(1 + 2r) 16(1 + 2r) 16(1 +2r) 16(1+ 2r) 16(1 + 2r) 
r(1 x. r) (1 .. r)” T T T T T T 
BB, 8(1--2r) 81-27) 16(1--2r) 16(1+ 2r) 16(1+ 2r) 16(1+ 2r) 16(1+ 2r) 16(1+ 2r) 
r is (1 z ry rA <2. r) T T T T 
CC, 16(1+ 2r) 16(1+ 2r) 8(1 4 2r) 8(1 + 2r) 16(1+ 2r) 16(1 + 2r) 16(1 + 2r) 16(1+ 2r) 
r T r(1— r) a-r? r r r r 
DD, 16(1 + 2r) 16(1 + 2r) 8(1--2r) 8(1 4 2r) 16(1+ 2r) 16(1 + 2r) 16(1 + 2r) 16(1 + 2r) 
r T T T (1 — r)” r(1 = r) r r 
FE 16(1+ 2r) 16(1+ 2r) 16(1+ 2r) 16(1+ 2r) 30427) 8(1+2r) 16(1+ 2r) 16(1+ 2r) 
T A T f r(1 — r) (1 — r)? T T 
FF 16(1+ 2r) 16(1--2r) 16(1-4-2r) 16(1 + 2r) 8(1:-2r) BC Fer) 16(1+ 2r) 16(1 + 2r) 
r r r r r r (1 — r)? r(1 — r) 
GG, 16(1 + 2r) 16(1 + 2r) 16(1+ 2r) 16(1+ 2r) 16(1+ 2r) 16(1 + 2r) (+ 2r) 8(1 4 2r) 
r r T r T r r(l = r) (1 HR r)” 
Hy Ay 16(1 + 2r) 16(1 + 2r) 16(1 + 2r) 16(1+ 2r) 16(1 + 2r) 16(1 + 2r) 8(142r) B(1 + 2r) 
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8.2.4 Estimation of the Recombination Frequency Between 
Any Two Categories of Markers 


Assume that n is the size of the eight-parental pure-line population, and n,—ng4 are 
the observed sample sizes of the 64 homozygous genotypes as given in table 8.10 (for 
DHs) and table 8.11 (for RILs). The 64 homozygous genotypes are arranged by 
rows, i.e., genotypes 1-8 take in the first row, 9-16 take in the second row, and so on. 
Use the theoretical frequencies in tables 8.10 and 8.11 to calculate the maximum 
likelihood estimate of recombination frequency between two complete loci in the DH 
and RIL populations, i.e., equations 8.9 and 8.10. It can be seen that equation 8.10 
is a quadratic equation on the recombination frequency to be estimated. One root of 
the equation is between 0 and 0.5 and can be treated as the estimate of recombi- 
nation frequency. 


(eight-parental DH population) (8.9) 


r= 
Si + 82 


2(n — sı — sş)f? — (2n — sı + sş)f + sı = Ü, (eight-parental RIL population) (8.10) 


sı = n — (m + mo + nig + nog + N37 + nüs + nss + N64), 


s) = 3(m + mo + rig + mg + maz + nüs + 55 + N64) 
+2(n2 + no + noo + N27 + gg + Nas + Nse + nes) 


+ ng + ny + My + ni + M7 + Ng + nas + no 
+ 19 + Nyo + N47 + Nays + N53 + N54 + N61 + N62, 
83 = 2(m + mo + mig + meg + 137 + nas + 155 + N64) 


+ (n2 + ng + nao + nar + ngs + Nas + nse + nes) 


Table 8.12 gives the observed sample sizes of 64 homozygous genotypes between 
two complete loci in an eight-parental pure-line population of size 200. If this 
population consists of DH lines, the recombination frequency is estimated at 0.0730 


TAB. 8.12 — Observed sample sizes of the 64 homozygous genotypes at two complete loci in an 
eight-parental pure-line population. 


Genotype at locus 1 Genotype at locus 2 
AgAy BoB, CC) DəDə BB FFs GoG) HMM 

A, A; 17 3 2 0 2 0 0 0 
BB, 0 15 1 2 1 2 0 1 
onan 2 1 18 1 1 4 0 0 
DD, 0 3 0 28 0 0 0 0 
EF 0 2 1 1 26 1 0 0 
FiF; 0 0 1 2 0 18 0 1 
GiGi 0 0 1 0 0 0 16 3 
HH, 1 0 0 0 0 0 0 22 
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from equation 8.9. If this population consists of RILs, the two roots of equation 8.10 
are equal to 0.0566 and —2.1042, and 0.0566 is the maximum likelihood estimator of 
recombination frequency. 

Equations 8.9 and 8.10 provide the estimation of recombination frequency for two 
complete markers in the eight-parental DH and RIL populations, respectively. 
The EM algorithm has to be adopted when incomplete markers are involved. Take 
two markers belonging to categories ABBBEFBA (locus 1) and AACDEEEH (locus 
2) as an example to briefly illustrate the use of the EM algorithm in recombination 
frequency estimation in eight-parental pure-line populations. At locus 1, parents A 
and H carry the same allele, parents B, C, D and G carry the same allele, and parents 
E and F carry different alleles. There are four different letters in the category name 
ABBBEFBA. Therefore, locus 1 has four identifiable alleles, and four identifiable 
homozygous genotypes. Locus 2 belongs to category AACDEEEH with five different 
letters and therefore has five identifiable alleles and five identifiable homozygous 
genotypes. Considering the two markers together, the progeny population has a total 
of 20 identifiable homozygous genotypes. Table 8.13 shows the relationship between 
the 20 identifiable homozygous genotypes and the 64 complete homozygous geno- 
types. Given an initial value of the recombination frequency to the EM algorithm, 
sample sizes of the 64 complete genotypes can be estimated from the observed sample 
sizes of 20 incomplete genotypes, by using the information indicated in table 8.13. 
Then equations 8.9 and 8.10 are applied to re-calculate the recombination frequency. 


TAB. 8.13 — Relationship between the 20 identifiable homozygous genotypes (1-20) at two 
incomplete markers and the 64 homozygous genotypes (1-64) at two complete markers. 


Allele Genotypes at two complete markers, Genotypes at two incomplete markers 

i.e., category ABCDEFGH of categories ABBBEFBA and 

AACDEEEH 

A C E F G H A B C D EF GH 
A 1 2 3 4 5 6 7 8 1 1 2 3 4 4 4 5 
B 10 11 12 13 14 15 16 6 6 T 8 9 9 9 10 
C 17 18 19 20 21 22 23 24 6 6 T 8 9 9 9 10 
D 25 26 27 28 29 30 31 32 6 6 T 8 9 9 9 10 
E 33 384 35 36 37 38 39 40 11 11 12 13 14 14 14 15 
F 41 42 43 44 45 46 47 48 16 16 17 18 19 10 19 20 
G 49 50 51 52 53 54 55 56 7 10 
H 57 58 59 60 61 62 63 64 1 1 2 3 4 4 4 6 


For the wo markers belonging to categories ABBBEFBA and AACDEEEH, the 
EM algorithm in recombination frequency estimation is described as follows. Firstly, 
an initial value of recombination frequency is assigned and then used to calculate the 
frequencies of the 64 complete genotypes in table 8.10 or 8.11. Next, the complete 
genotypes which are included in each of the 20 identifiable genotypes are identified 
from table 8.13, and the ratio of frequencies of the complete genotypes is calculated. 
An observed sample size of each identifiable genotype is split to the included complete 
genotypes by the calculated ratio. Let kand p; denote the complete genotype included 
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in the ith identifiable genotype and the theoretical frequency of the complete geno- 
type, and 37 py denote the sum of frequencies of all complete genotypes included in 
the ith identifiable genotype, which is also the theoretical frequency of the ith iden- 
tifiable genotype. Let m; denote the observed sample size of the ith identifiable 
genotype (i.e., right part in table 8.13), and m, denote the expected sample size of the 
kth complete genotype (£.e., left part in table 8.13). The observed sample size m; is 
then split to the expected sample size m, of the complete genotype by equation 8.11. 


(8.11) 


nk = H Mi 
> Pk 

For example, in table 8.13, identifiable genotype 1 contains the complete 
genotypes 1, 2, 57, and 58, i.e., k = 1, 2, 57, 58, and 7 = 1 in equation 8.11. From 
table 8.10 or 8.11, frequencies of the four complete genotypes can be calculated, i.e., py 
in equation 8.11, given an initial value of recombination frequency. Then the observed 
sample size of the identifiable genotype, i.e., ma, is split to the sample sizes of the 
included complete genotypes, i.e., 21, mə, N57, and msg, by equation 8.11. When sample 
sizes of the 64 complete genotypes have been assigned, the recombination frequency is 
updated by equation 8.9 or 8.10, and then used as a new initial value for the next 
iteration. 


8.2.5 Situations When the Number of Inbred Parents 
Smaller Than Eight 


Similar to the four-parental pure-line populations as introduced in §8.1, the number of 
inbred parents may be smaller than eight. For example, when parents G and H are the 
same as parents A and B, respectively, there are only six parents, and all markers 
belong to category ABCDEFAB. When parents E to H are the same, there are only five 
parents, and all markers belong to category ABCDEEEE. If parents E to H is the same 
as parents A to D, respectively, there are only four parents, and all markers belong to 
category ABCDABCD. Therefore, a large amount of pure-line populations with less 
than eight parents can also be analyzed by the methods introduced in this section. 

By the way, the pure-line population derived from the eight-way cross repre- 
sented by A/B//C/D///A/B//C/D is equivalent to the pure-line population 
developed by haploid doubling or repeated selfing after one generation of random 
mating in a double cross F, population. Therefore, strictly speaking, the 
four-parental pure-line population as introduced in §8.1 cannot be viewed as a 
special case of the eight-parental population introduced in this section. 


8.3 QTL Mapping in Four-Parental Pure-Line 
Populations 
Linkage analysis, map construction, and QTL mapping can be completed in the 


GAPL software for four kinds of multi-parental pure-line populations, i.e., the 
four-parental DH and RIL populations, and eight-parental DH and RIL populations 
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(Zhang et al., 2019). The linkage analysis method has been introduced in §8.1 for the 
pure-line populations derived from four inbred parents, and in §8.2 for the pure-line 
populations derived from eight inbred parents. Based on the estimated pair-wise 
recombination frequencies, the genetic linkage maps can be constructed in these 
populations. QTL mapping method in the four-parental DH and RIL populations will 
be introduced in this section. The readers can refer to Zhang et al. (2017) for more 
details not covered in this section. 


8.3.1 Genetic Constitution at Three Complete Loci 


To introduce the imputation algorithm on incomplete and missing marker types, 
theoretical frequencies of the joint genotypes at three complete loci are introduced 
first. To be consistent with the following sections on QTL mapping, one complete 
locus is viewed as the QTL located between two complete markers. To obtain the 
theoretical frequencies of 64 homozygous genotypes in the pure-line progenies, a 
generation transition matrix similar to the one given in table 8.3 has to be defined. 
However, when three complete loci are considered together, it is a 64 X 64 square 
matrix which is not convenient to be given here in detail. Considering the three 
complete loci jointly, table 7.14 in chapter 7 provided the theoretical frequencies of 
64 heterozygous genotypes in the double cross F; population. As an example, the 
first genotype A,A,A2/C,C,C> as given in table 7.14 is used here to explain the 
elements in the transition matrix. The DH and RIL progenies derived from genotype 
A, A,Ao/C,C, C2 contain eight possible homozygous genotypes, and their theoretical 
frequencies are given in table 8.14. Frequencies of the other 56 homozygous geno- 
types are equal to 0. Therefore, the vector corresponding to this genotype in the 
transition matrix can be defined. It can be seen that the frequencies given in 
table 8.14 for the joint genotypes at three loci are actually the same as the DH and 
RIL populations derived from the single cross F; (see table 4.8 in chapter 4), where 
parents A and C are used to make the single cross F4. 

Using the generation transition matrix from the double cross F, to DH lines at 
three complete loci, theoretical frequencies of 64 homozygous genotypes in the 
four-parental DH population can be acquired and shown in table 8.15, where rı, rə, and 
rare the recombination frequencies between the left marker (i.e., locus 1) and locus q, 
between locus q and the right marker (i.e., locus 2), and between locus 1 
and locus 2, respectively. The three recombination frequencies have the relationship 
ro ri +r — 2rir) by assuming that the recombination events in two neighboring 
intervals are independent. Marginal frequencies of the joint genotypes at locus 1 and 
locus 2 are the same as those given in table 8.4 and therefore are not shown in table 8.15. 

Using the generation transition matrix from the double cross F; to RILs at three 
complete loci, the theoretical frequencies of 64 homozygous genotypes in the 
four-parental RIL population can be acquired and shown in table 8.16, where r), rə, 
and rare the recombination frequencies between loci 1 and q, between loci q and 1, 
and between loci 1 and 2, respectively. The three recombination frequencies have the 
relationship r= rı +r — 2rirə by assuming that recombination events in two 
neighboring intervals are independent. R is the cumulated recombination frequency 
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TAB. 8.14 — Homozygous genotypes and their theoretical frequencies in the four-parental DH 
and RIL populations derived from genotype A;A,A2/C,C,C) in the double cross F, 
population. 


Genotype in the pure-line Four-parental DH Four-parental RIL 
progenies population population 

A, AgAo/A;AgA2 sü — n)(1 — rə) ia — Rı)(1 — Rə) 
0O OQ, “Unj 20 - Ri) - Ra) 
ALAQCə/ A1 Aq Cə ae — Tı)rə za — hı)Rə 

Ci CyAa/ Cy CAs 20 - nn 20 - R) 
A,C.Cə/ A1 Cy C2 int- r2) R(= Re) 

Ci AqA2/ C1 AgAe ina — rə) md — Rə) 

A, C,AŞ// A CqA2 inm m. 

Ci Aq Cə/ C1 Aq Cə in Tə m 


Note: the three complete loci are denoted as 1, q, and 2. Locus q is located between locus 1 and 
locus 2. r) and rə are the recombination frequencies between locus 1 and locus q, and between 
locus q and locus 2, respectively. Rı and Rə are the cumulated recombination frequencies during 
the repeated selfing between locus 1 and locus q, and between locus q and locus 2, respectively. 


during repeated selfing, i.e., Ry = 1537 məz , and Ry =] o Theoretical genotypic fre- 
quencies when locus 1 and locus 2 are considered jointly are the same as those given 


in table 8.5 and therefore are not shown in table 8.16. 


8.3.2 Imputation of the Incomplete and Missing Marker 
Information 


In the GAPL integrated software, valid values representing the progeny genotypes in 
four-parental DH and RIL populations are A, B, C, D, and X, applicable to all 
categories of markers as defined in table 8.1. Any other values are treated as invalid 
when the population is loaded into the software for analysis (Zhang et al., 2019). In 
counting the observed sample sizes of identifiable genotypes, the software will 
automatically combine some values based on the specified marker categories. For 
example, for one marker belonging to category AACD (£.e., category 2 in table 8.1), 
the software will combine codes A and B, and make three identifiable genotypes at 
the marker locus. For one marker belonging to category ABBB (i.e., category 11 in 
table 8.1), the software will combine codes B, C, and D, and make two identifiable 
genotypes at the marker locus. When considering two markers belonging to cate- 
gories AACD and ABBB jointly, the software will classify the pure-line progenies into 


TAB. 8.15 — Theoretical frequencies of the 64 homozygous genotypes at three complete loci in the four-parental DH population. 


Genotype at locus 1 Genotype Genotype at locus q (located between locus 1 and locus 2) 
at locus 2 

A,Ag Bab, Cy Cg D.D, 
1 ‘ ‘ 1 1 1 

ALA AyA» 40: n)(1- ny uu Ti)rə(1 Tə) s” rə(1 r) gin r) 
1 1 ‘ 1 1 

ALA B>B eesi =a an ror gonr 
1 1 1 1 

ALA Cə Cə s01-n) Tə zol- nr ARE zona- r) 
1 Q 1 1 1 

ALA DoD» z307 Tə gntl— nr s” rə(1 — rə) gti — 7) 
1 1 Q 1 1 

B.B Ay A» 1" 1— n)(1-— m)” 40: n)”m(1- m) grrr grr 
1 1 1 1 

BiB By Bo 4” 1 — ri)rə(1 — rə) rac npa rə)” s” rə(1 — r) znel- r) 
1 1 Q 1 1 

B,B Cə Cə 3” 1 — rı)rə un ler z Tirə(1 — rə) 
1 1 1 1 

BiB DDə 2 1 — nr) 501-n) Tə g” m(1— r) g^ 1- n)? 
1 1 1 ‘ 1 

CC, AzAş 3” 1— m)” si m(l — mə) şü — rı) mə a” 1-— ri) 
1 1 ‘ 1 ‘ 1 

ac By By gm m) gn(l—n) 50-) Tə 3” 1-— ri) 
1 1 1 ‘ 1 

COO, CC a” m(1— r) ar m(1— r) nü A” 1 — ri)rə(1 — rə) 
1 1 1 : 1 é 

CC, D.D, grr gonr 40: n)”m(1-- m) a” 1— n)(1-— rə)” 
1 1 1 1 ; 

D:D, 4:42 zna- n) grin — m) Aane E n) r 
1 1 ‘ 1 1 : 

D, Dı BoB s” rə(1 — rə) s” (1 — rə)” gn(l—n)r “1 
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Tas. 8.15 — (continued). 


Genotype at locus 1 Genotype Genotype at locus q (located between locus 1 and locus 2) 
at locus 2 
AgAg BaBa CaCa DaDa 
1 1 1 2 1 2 
D,D, C22 grr gr q(t wa- m) 40-n) rə(1 — rə) 
1 1 1 1 2 2 
Dı Dı DəDə zora- r) z5r- r) qn Tı)rə(1 — rə) qd rı) (1 — rə) 


Note: the three loci are denoted as 1, q and 2. Locus q is located between locus 1 and locus 2. rı, rə and r are the recombination frequencies 
between locus 1 and locus q, between locus q and locus 2, and between locus 1 and locus 2, respectively, where r = r) + m — 27irə, i.e., assuming 
that the recombination events in two neighboring intervals are independent. 
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TAB. 8.16 — Theoretical frequencies of the 64 homozygous genotypes at three complete loci in the four-parental RIL population. 


Genotype Genotype Genotype at locus q (located between locus 1 and locus 2) 

at locus 1 at locus 2 AgAq B.B, C.C, DaDa 

Adı Ay Aa 20: n)(:- n)(1- Ri)(l — Re) güni - Ri)(1— Rə) "uu "um 

ALA, BB 7 (1 n) — Ri)(1 — Rə) nQ- e)l- Ri)t - Ra) gr Re gr ke 

ALA, C, C, şü n)1- RB snd - R) g(t n)H(1- Re) zall- Re) 

ALA, DDə şü n)1- Rik snd - Ri) şəh - Ra) şü m) - Re) 

BB, AyAg "Üz e)l- Ri)(1 — Re) i —1r)m(1— Rı)(1 — Re) gre ke gr ke 

BiB, By By zmm(l — Ri) — Re) i —n)(1—m)(1— Rı)(1 — Re) 50-ə "um 

B.B, OG Enl- Ri) S0- n)t - R)Re Z0- n) (1 R) grRi(l - Ra) 

BiB, DD» gid - Ri)Re a( —r)(1 — Ri) Ro ədd — Ra) g(l— m2) Ri — R) 

dı Azdə g(t —n)Ri(1 = R) FrP (1— Re) S0- n)ti - Ri) Re gn - Ri) Re 

OC, B,B, arRa(l — Rə) a(t m)Ri(1 - Ra) gü - n0- Ri) Re ar Ri) Rs 

ac GG TQ- Rk sR, F(l—n)(—m)(1- Ri) - Be) Fnim(1- R) - R) 
(Onan D.D, gi Re “207 20: n)ə(1- Ri)(1 — Ro) zm nə)(1- BU — Re) 
DD, AA, 50 nz - R) 21 nü — Ri) Re gC TWA- Ri) Re 

DD, B>B» şəR( — R) 50 - n)zi(1 - R) anh Ri) Re S0- r0- Ri) 

D:D, OO STR Re 0 Lal- m)tl — Ri)(1— Ra) 10 - rna- Fi) - Re) 
D,D, D:D» şü - nin süni: znm(1— Ri)(1 — Rə) 20 - n)(1- m)(1- Bi) R) 


Note: the three loci are denoted as 1, q and 2. Locus q is located between locus 1 and locus 2. rı, rə and rare the recombination frequencies between locus 1 and locus q, between locus q and locus 2, and 
between locus 1 and locus 2, respectively. r = rı + rə — 2rı m, i.e., assuming that the recombination events in two neighboring intervals are independent. Rı and Rz are the cumulated recombination 
frequencies during repeated selfing between locus 1 and locus q, and between locus q and locus 2, respectively. 
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six identifiable genotypes, i.e., (1) A + B at locus 1, and A at locus 2; (2) A + Bat 
locus 1, and B + C + Dat locus 2; (3) C at locus 1, and A at locus 2; (4) C at locus 1, 
and B + C + D at locus 2; (5) D at locus 1, and A at locus 2; (6) D at locus 1, and 
B + C + D at locus 2. Observed sample sizes of the six identifiable genotypes are 
automatically counted by the software and then used in recombination frequency 
estimation. 

From the established linkage information, the incomplete and missing marker 
information can be imputed and then treated as completely informative, which will 
significantly simplify the QTL mapping methodology. For example, for one marker 
belonging to category AACD, codes A and B represent the same identifiable 
genotype. Pure lines coded by A or B are assigned to either A or B by certain rules. 
For category ABBB, codes B, C, and D represent the same identifiable genotype. 
Pure lines coded by B, C, or D are assigned to either B, C or D by certain rules. 
Code X representing the complete missing marker types is assigned to either A, B, 
C, or D by certain rules. After the imputation, all markers are converted to 
category ABCD without any incomplete and missing information. This imputation 
will greatly simplify the QTL mapping procedure, which will be seen in §8.3.3. 

Similar to the procedure introduced in §7.3.4, imputation of the incomplete and 
missing marker types is conducted by their orders on the constructed linkage map. 
Assume all markers before the current one to be imputed belong to category ABCD. 
Probabilities of the possible complete genotypes are calculated first, and then the 
incomplete and missing marker types are replaced with complete marker types by 
the ratio of probabilities. Imputation is conducted separately for pure lines in the 
progeny population, and the imputed results may be different for different pure lines 
even for the same incomplete or missing values. For example, for category AACD, 
two pure-line progenies both coded by A may be imputed as A and B, respectively; 
two pure-line progenies coded by B and A may be imputed as A and B, respectively. 
Three situations occur for the imputation algorithm, based on the number of 
fully-informative markers which are linked with the current one to be imputed. In 
each situation, completely missing value X, and two categories of incomplete 
markers, i.e., AACD and ABBB, are used as an example to show the probabilities 
of complete genotypes included in identifiable genotypes. Other incomplete 
categories of markers are left to the readers. Three identifiable genotypes of the 
category AACD marker are denoted as A + B, C, and D. Two identifiable geno- 
types of the category ABBB marker are denoted as A and B+ C + D. 


1. No Linkage information can be utilized 

As there is no linkage information can be utilized, missing value X has an equal 
probability to be either A, B, C, or D; incomplete genotype A + B at the Cate- 
gory AACD marker has an equal probability to be either A or B; incomplete 
genotype B + C + D at the Category ABBB marker has the equal probability to be 
either B, C or D. Therefore, the following ratios of the included complete genotypes 
can be used in imputation. 


P{A|X} : P{B]X} : P{C|X} : P{D|X} =1:1:1:1 (missing value X) 


P{A|A+B}: P{BJA+B} — 1:1 (category AACD) 
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PİBİB H CD) : P{C|B+C+D}: P{D|B+C+D}=1:1:1 (category ABBB) 


Take a random number rd between 0 and 1. If rd < 0.25, X will be replaced by 
A; if 0.25 < rd < 0.5, X will be replaced by B; if 0.5 < rd < 0.75, X will be replaced 
by C; otherwise, X will be replaced by D. Imputation method is similar for 
A + Band B + C + D, which will not be described in detail. Of course, there is no 
need to do an imputation for genotypes C and D at the category AACD markers. 
The same is true for genotype A at the category ABBB markers. 


2. One fully-informative linked marker can be utilized 

For the four-parental DH population, locus 2 in table 8.4 is viewed as the current 
locus to be imputed, and locus 1 is viewed as the fully-informative marker linked 
with the current marker. Obviously, the theoretical frequencies of the complete 
genotypes at locus 2 are dependent on the genotypes at locus 1. Take code A at locus 
1 as an example, i.e., genotype 4: 4:. For value X at locus 2, the four frequencies 
corresponding to genotype A, A, as given in table 8.4 can be used to determine the 
probability ratio of the four complete genotypes. For incomplete value A + B at 
locus 2 with Category AACD, two frequencies corresponding to genotypes AA» and 
B>B, as given in table 8.4 can be used to determine the probability ratio. For 
incomplete value B + C+ D at locus 2 with category ABBB, three frequencies 
corresponding to genotypes Bə 5, CyCs, and DD; as given in table 8.4 can be used 
to determine the probability ratio. The three example probability ratios are given 
below. 


P{A|X} : P(BİX) : P(CİX) : P{D|X} = 2(1 — r)” :2r(1 — r) : r : r (value X) 
P{A|A +B}: PİBİA HB) — 1—r:r (Category AACD) 


P{B|IB+C+D}: P{C|]B+C+D}: P{D|]B+C+D} 
= 2(1 — r) : 1 : 1 (Category ABBB) 


For the four-parental RIL population, locus 2 in table 8.5 is vievved as the current 
locus to be imputed, and locus 1 in table 8.5 is viewed as the fully-informative marker 
linked with the current locus. Therefore, the theoretical frequencies given in table 8.5 
can be used to calculate the probability ratio of complete genotypes included in the 
missing or incomplete marker type in the four-parental RIL population. 


3. Two fully-informative flanking markers can be utilized 

The two fully-informative flanking markers are treated as locus 1 and locus 2, 
and the marker to be imputed is treated as locus q which is located between locus 1 
and locus 2 in the linkage map. Under the condition of each of the 16 complete 
genotypes at the two flanking complete loci, theoretical frequencies of the four 
genotypes at locus q (the locus to be imputed) can be acquired from the joint 
frequencies as given in table 8.15. Obviously, genotypic frequencies at locus q depend 
on the joint genotypes at locus 1 and locus 2. Take genotype A,A,A A» as an 
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example, the four frequencies corresponding to genotype AAAA as given in 
table 8.15 can be used to determine the probability ratio of possible complete 
genotypes included in missing or incomplete genotypes. Given below are three 
example probability ratios for value X at locus 2, incomplete value A + B at locus 2 
belonging to category AACD, and incomplete value B+ C+D at locus 2 
belonging to category ABBB. 


P{A|X} : P{B|X} : P{C|X} : P{D|X} 
= 2(1 — rı)7(1 — rə)” : 2n(1 — ni)rə(1 — rə) : rrm(1 — r) : rire(1 — r) (value X) 


P{A|A+B}: PİBİA +B} = (1 — n)(1 — rə) : rım (category AACD) 


P{B|IB+C+D}: P{C|]B+C+D}: P{D|B+C+D} 
= 2(1 — ry)(1 — rə) : 1 — r: 1 -— r (category ABBB) 


Similarly, theoretical frequencies given in table 8.16 can be used to calculate the 
probability ratios of possible complete genotypes included in missing or incomplete 
marker types in the four-parental RIL population. 


8.3.3 The Linear Regression Model of Phenotype 
on Marker Types 


Assume Aç, By, Cy, and D, are the alleles at one QTL that occurred in four inbred 
parents. Genotypic values of the four QTL genotypes are given in equation 8.12, as 
for the one-locus additive genetic model. 


Uk = HUF ay, Wp, (8.12) 


where k = 1—4 representing the four homozygous genotypes at the QTL; u, is the 
kth genotypic value at the QTL; u is the overall mean of the four QTL genotypic 
values, or mean of the progeny population; aş is an additive effect of the Ath allele, 
and wis an indicator of QTL genotype valued at 1 for the kth parental type, and 0 
for the other parental types. 

On the other side, if the four genotypic values are known, overall mean and 
additive effects can be calculated by equation 8.13. 


= (i+ Ha + Hy + Ha), 

a= "Gü Hə — H3 — Ha), 

o = E (Btla — ki — hi — hi), (8.13) 
a3 = "Öl Hy — Hə — La), 

da = "lu H — Hə — Hs) 
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When there is no segregation distortion, genetic variation contributed by the 
QTL can be given by equation 8.14. 


Ve = (a + ab + ab + a2) (8.14) 
One restriction has to be made so as to estimate the five genetic parameters in 
equations 8.12 and 8.13, i.e., the sum of the four additive effects has to be equal to 0. 
To avoid the complexity caused by the restricted condition in parameter estimation, 
one orthogonal model is built, which is equivalent to equation 8.12 but without any 
restrictions. Assuming there are a number of m QTLs in the population, the 
genotypic value of the jth QTL is given in equation 8.15 by using the orthogonal 
model. 


Gj = B+ bay + by + bjzujvj (8.15) 


where u, and o) are orthogonal indicators of the jth QTL genotypes valued at 1 and 1 
for A,Ag, 1 and —1 for B,By, -1 and 1 for C,C,, and —1 and —1 for D,D,. The 
relationship of parameters in equation 8.15 with parameters in equation 8.12 is 
given in equation 8.16 (the readers can use exercise 8.11 to derive and prove the 
relationships). 

(dı + a2), ba = 


1 
ba = (a + aş), bə = 5 (a + a) (8.16) 


Nl =| 
Nl =| 


Under the assumption of additivity on genotypic effects from different QTLs, the 
total genotypic value of the pure-line progeny can be given by equation 8.17. 


G= pt Do (bau + bjw + buju) (8.17) 


gel 


Similar to ICIM as has been introduced in chapters 5 and 7, we can start from 
equation 8.15 and build the inclusive linear model between genotypic values of the 
jth QTL and the flanking markers, and then use equation 8.17 to build the inclusive 
linear model of phenotypic values of pure-line progenies depending on the 
genome-wide markers. Detailed information can be found in Zhang et al. (2017). 
Assume there are a number of m QTLs located on m intervals defined by m + 1 
markers on one chromosome. There is at most one QTL in one marker interval. For 
the intervals without QTLs, QTL effects are set at 0. Similar to the orthogonal 
indicators on QTL genotype as defined in equation 8.15, orthogonal indicators are 
also defined for markers, and the linear regression model of phenotypic values on 
markers can be derived and given in equation 8.18 for the pure-line progenies. 


m+1 
P= wt X (aj + iyi“ yry) ka (8.18) 


j=l 


where P is the phenotypic value of the pure-line progeny; ¢ is a random error, 
assumed to be normally distributed; a,, Jj, and qt; are the effects of the jth marker 
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caused by QTL (to be estimated); x; and y; are orthogonal indicators of the jth 
marker valued at 1 and 1 for the parent A marker type, 1 and —1 for the parent B 
marker type, —1 and 1 for the parent C marker type, and —1 and —1 for the parent D 
marker type. 

It should be noted that the QTL effects can also cause interactions between 
flanking markers, similar to the epistatic effects between markers caused by the 
dominant effect of QTL in bi-parental Fə populations (see chapter 5), and the 
interaction between two markers caused by the dominant effect in double cross F; 
population (see chapter 7). However, to reduce the number of parameters in the 
regression model, the interaction effects between markers are not considered in 
equation 8.18. Strictly speaking, equation 8.18 is not an inclusive linear model of 
the phenotype on genome-wide markers. When the recombination frequencies 
between QTL and the linked markers are not equal to 0, there may be part of 
QTL variation which cannot be fitted in the linear model of equation 8.18. The 
unfitted genetic variation will be added to the random error in estimation and 
testing. 


8.3.4 Inclusive Composite Interval Mapping (ICIM) 
in Four-Parental Pure-Line Populations 


Similar to the bi-parental population (chapters 5 and 6; Li et al., 2007; Zhang et al., 
2008) and four-parental double cross F; population (chapter 7; Zhang et al., 2015b), 
ICIM method as applied in the four-parental pure-line population also has two steps. 
Firstly, by considering all marker information, equation 8.18 is used to select the 
most significant variables. The coefficients of those variables not retained by the 
stepwise regression are set at 0. Secondly, phenotypic values are adjusted by sig- 
nificant markers selected by the stepwise regression, 1.€., 


AP, = P; — 5 (&j Tij + Biyu + Èj Lig Yay) (8.19) 
jfk k+1 


where P, is the phenotypic value of the ith pure-line progeny (i = 1, 2, ..., n, and nis 
the population size): k and k + 1 represent two flanking markers of the current 
scanning position; the hat symbol represents the estimated value of a parameter; xj 
and gə, are orthogonal indicators of genotypes of the ith progeny at the jth marker. 
The adjusted phenotypic value AP; contains information on the position and effects 
of QTL at the current interval, and in the meantime, most variations of QTLs out of 
the current interval have been excluded by the adjustment. 

Interval mapping is conducted on the adjusted phenotypic values. Phenotypes of 
the four homozygous QTL genotypes follow normal distributions with different 
means but the same variance, i.e., N(u,, 0°), k = 1, 2,3, 4. The null and alternative 
hypotheses used to test the existence of a QTL are, 


Ay : fy = My = Mg = My and 
HA : at least two of 44, Hə, u3 and u4 are not equal 
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The logarithm likelihood under HA is, 


İn LA = 5 Xn b Tif (AP is up 0 ? (8.20) 


gel ies; 


where 5) represents the individuals included in the jth marker class on two flanking 
markers (j = 1, 2, ..., and 16, corresponding to the 16 genotypes given in table 8.15 
or table 8.16); zip (k = 1, 2, 3, 4) is the proportion of the kth QTL genotype in the 
jth marker class, i.e., the four theoretical frequencies in the jth column in table 8.15 
or table 8.16 when divided by sum of the frequencies; and f(e; u}, o?) represents the 
density function of normal distribution N (uy, 07). 

The EM algorithm is used for the maximum likelihood estimation of parameters 
in equation 8.20. Most pure lines in marker classes 1, 6, 11, and 16 have QTL 
genotypes A,A,, BaBa; CaCa and DaDa, respectively. Hence, initial values of the 
parameters to start the EM algorithm can be defined as follows. 


mi 
2 u = 7 AP, 


"6 m +1 
mil n 
L S ap = ib S ar 
i=ñn:10 + 1 ms, n:s +1 
2(0) — 1 o 4 v (0) 2 
o AP, — + AP; — py’) 
se ( m 2 ' g 
Tı) 0 n 0) 
2 
+ Yar y+ Y AP- y] 
i=ma9 +1 i=m:15 +1 
where n;; represents the summation from n; to nj, for example, m: m nj. 


In the E-step, the posterior probability of the. ith pure line belonging to the kth 
QTL genotype is given by, 


wo = 7 ma (APA; My a) 
.. D 1 Taf (AP; Hp, or) 


In the M-step, the unknown parameters in the log-likelihood function are 
updated by, 


(1) — 1 uğ AP; 
Be yO) aie =i y ul why (AP, — my’)? 
i=1 “ük MI 1 


Under Hp, genotypic values of the four QTL genotypes follow the same normal 
distribution. It is easy to acquire the maximum likelihood estimates of parameters, i. e., 


İş = 28 AP,, 6 = > (AP, — 
i=1 


„iE Sj 
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LRT statistic and LOD score are calculated from the maximum likelihood 
functions under the two hypotheses and then used to test the existence of QTL. 
Values of parameters corresponding to the maximum likelihood functions under the 
alternative hypothesis are the maximum likelihood estimates of the parameters to be 
estimated. 

In addition to ICIM, the GAPL software also implements single marker analysis 
(SMA) and simple interval mapping (IM) for the four-parental DH and RIL popu- 
lations. Figure 8.4 shows the LOD score histogram from SMA, and LOD score pro- 
files from IM and ICIM in one simulated four-parental population consisting of 200 
pure lines. It can be seen that for the three methods, there are markers or positions 
with significantly high LOD scores on the first six chromosomes. If the LOD score 
threshold is set at 4.0, there are QTLs affecting the phenotypic trait on these chro- 
mosomes. LOD scores from IM (figure 8.4B) and SMA (figure 8.4A) are similar. If 
the data points in figure 8.4A are connected to make a line, the line would be similar 
to the LOD profile of IM. If peaks on the LOD profile higher than the LOD threshold 
are viewed as QTLs, mapping results from the two methods are actually the same. 
Both methods have no background control in QTL detection. If the linkage map is 
dense enough, the two methods are always close to being equivalent. 

In the simulated population shown in figure 8.4, there is one QTL each on the 
first six chromosomes and no QTLs on the last two chromosomes. Except for the 
peak at the end of chromosome 4, ICIM shows peaks in its LOD score profile around 
the true positions of pre-defined QTLs and has low LOD scores at positions far from 
the putative QTLs and on other chromosomes without any QTLs. Meanwhile, 
around the QTL positions, LOD scores from ICIM are much higher than those from 


OSingle marker analysis 


LOD score 


111111112222222333333334444444555555556666666777777778888888 


Chromosome number of markers 


-——— Simple interval mapping 


Inclusive composite interval mapping 
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a8 
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0. asst anı Tər ————r-T—T rr r r e (aa 
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Chromosome number of scanning positions 
B 


Fic. 8.4 — Bar graph of LOD score from single marker analysis (SMA) (A) and profiles of 
LOD score from simple interval mapping (IM) and inclusive composite interval mapping 
(ICIM) (B) in a simulated four-parental DH population. 
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IM; apart from the pre-defined QTLs and on chromosomes without QTLs, LOD 
scores from ICIM are much lower than those for IM. Therefore, the background 
control by adjusting the phenotypic values (equation 8.19) not only improves the 
QTL detection power but also reduces the false discovery rate in QTL mapping 
studies on multi-parental mapping populations. 

Table 8.17 provides the information on positions, effects, and confidence intervals 
of the QTLs detected by ICIM. The LOD threshold is set at 3.96, determined from 
1000 times of permutation tests. The left and right sides of the confidence interval 
are the two positions when the LOD score is reduced by one from the peak position 
of the detected QTL. QTL located at 123 cM at the end of chromosome 4 is far from 
the pre-defined QTL and is treated as one false QTL. The other six QTLs are close 
to the six putative QTLs. There is a positive correlation between the LOD score and 
the PVE of the detected QTL. PVE of QTL on chromosome 4 is the highest, and so 
is the LOD score at the QTL position. PVEs of QTLs on chromosomes 2 and 5 are 
similar, and so are their LOD scores. From estimates of the additive effects, parent A 
has the alleles with the largest additive effects at the six detected QTLs. If the higher 
phenotype is favored, parent A has integrated all favorite alleles at the six detected 
QTLs. 


Tas. 8.17 — Mapping results from ICIM in a simulated four-parental DH population. 


Chr. Pos. LOD PVE Additive effect of alleles Confidence 
(cM) score (%) interval 
A (a) B(a) C (aş) D(a) Left Right 
(cM) (cM) 
1 18 17.23 14.37 4.64 2.73 —2.61 -—4.77 14.5 19.5 
2 58 13.84 10.90 4.94 1.09 —2.28 -3.75 53.5 59.5 
3 24 8.99 6.67 3.19 1.46 —1.12 -3.54 16.5 29.5 
4 48 18.25 14.76 5.28 1.53 —3.23 -3.57 44.5 49.5 
4 123 5.63 4.55 2.23 —2.83 1.96 -1.35 120.5 127.5 
5 36 12.36 10.04 4.43 1.26 -0.97 -—4.72 30.5 39.5 
6 54 11.24 8.76 3.58 1.75 —1.61 -3.72 49.5 57.5 


8.4 QTL Mapping in Fight-Parental Pure-Line 
Populations 
QTL mapping method in the eight-parental DH and RIL populations will be briefly 


introduced in this section. The readers can refer to Shi et al. (2019) for more details. 


8.4.1 Genetic Constitution at Three Complete Loci 


Similar to the previous section, theoretical frequencies of the joint genotypes at three 
complete loci are introduced first, which can be used in the imputation of incomplete 
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and missing marker information. To be consistent with the folowing QTL mapping, 
one complete locus is viewed as the QTL located between two complete markers. 
Considering the three linked complete markers at the same time, there are a total of 
8? = 512 homozygous genotypes in the eight-parental pure-line populations. To 
obtain the theoretical frequencies of homozygous genotypes in the pure-line proge- 
nies, a generation transition matrix similar to the one given in table 8.3 has to be 
defined. The transition matrix gives the frequencies of 512 homozygous genotypes 
generated from each genotype included in the eight-way cross F, population 
(figure 8.3). 

The eight-way cross F, population is generated by the random unit of gametes 
from the ABCD double cross F,, and gametes from the EFGH double cross Fj. 
Considering three loci jointly, theoretical frequencies of the 64 gametes generated by 
both double crosses are exactly the same as frequencies of the 64 homozygous 
genotypes in the four-parent DH population, as have been given in table 8.15. 
Therefore, theoretical frequencies of the 64? = 4096 genotypes included in the 
eight-way cross Fi population can be acquired from the frequencies given in 
table 8.15. For example, the frequency of gamete A} 4.45 generated by the ABCD 
double cross F is equal to +(1 — r))7(1 — mə)”, and the frequency of gamete EF, E, Fy 
generated by the EFGH double cross F; is equal to } (1 — r1)’ (1 — rə)”. Therefore, 
the theoretical frequency of genotype A;A,A2/E,E,£ in the eight-way cross Fy 
population is equal to the multiplication of two gamete frequencies, i.e., 
$ (1— r*a — rə)". The DH and RIL progenies generated by this genotype contain 
eight possible homozygous genotypes. If alleles £,, Eq and Ey are viewed alleles C4, 
Ca, and Cy in table 8.14, respectively, genotypic frequencies as given in table 8.14 
are also the genotypic frequencies of DH and RIL progenies generated by genotype 
A,A,Ao/E\E,E:. Therefore, the vector corresponding to this genotype in the 
transition matrix can be defined. 

Using the generation transition matrix from the eight-way cross F; to DH lines at 
three loci, the theoretical frequencies of 512 homozygous genotypes in the 
eight-parental DH population can be achieved. Table 8.18 only provides the theo- 
retical frequencies for partial progeny genotypes when alleles at locus 1 are A; and 
Bi. In frequencies given in the table, rı, rə, and r are the recombination frequencies 
between the left marker (i.e., locus 1) and locus q, between locus q and the right 
marker (i.e., locus 2), and between locus 1 and locus 2, respectively. The three 
recombination frequencies have the relationship r= rı +m — 2r;m, by assuming 
that the recombination events in two neighboring intervals are independent. 

Using the generation transition matrix from the eight-way cross Fı to RILs at 
three loci, the theoretical frequencies of 512 homozygous genotypes in the 
eight-parental RIL population can be achieved. Table 8.19 provides the theoretical 
frequencies of partial progeny genotypes when alleles at locus 1 are A; and Bi. In 
frequencies given in the table, rı, rə, and r are the recombination frequencies 
between the left marker (i.e., locus 1) and locus q, between locus q and the right 
marker (i.e., locus 2), and between locus 1 and locus 2, respectively. The three 
recombination frequencies have the relationship r= rı +m — 2r;m), by assuming 
that the recombination events in two neighboring intervals are independent. R is the 


TAB. 8.18 — Theoretical frequencies of partial homozygous genotypes at three complete loci in the eight-parental DH population. 


Genotype 
at locus 1 


A, A, 
A, A, 
A, A, 
A, A, 
A, A; 
A, A, 
A, A, 
ALA: 
Bibi 
BiB 
BB, 
BB, 
BB, 
BB, 
BB, 
BB, 


Genotype 
at locus 2 


4242 
B,Bə 
OQ 
D.D, 
ELE» 
FoF, 
GG, 
Hy Ay 
Ag Ag 
By By 
OQ 
D.D, 
ELE» 
FoF, 
GG, 
Fy Hy 


Hü “ə rə)”/8 

Srə(1 — rə)”/8 

Snə(1 — m) /16 

"rə(1 — rə)/16 

`ra/32 

“r, /32 

`ra/32 

rı) (1 — rə)”/8 

r))”rə(1 — r2)°/8 

rı)’ ra(1 — mə)/16 

ri)”rə(1 — r2)/16 

rı)?rş/32 
2 
2 
2 


rı) 12/32 
rı) 19/32 
rı) 12/32 


Genotype at locus q (located between locus 1 and locus 2) 


BaBa CaCa DaDa 

nl- nnll- nə)7/8 nl- nr) - mə) (1 — r)/16 m(1:— ri)rə(l - r(1- r)/16 
nl- ri mel = rə)” /s rı(1 — ri)rə(1 — r2)r/16 ry(1 — ri)rə(1 — rə)r/ 16 

r(1 — m)?rə(1 — rə)/16 ri(1 — m)(1 — rə)”/16 na — mi)rə(1 — rə)”/16 
nl- ri 2pə(1 — rə)/16 mn — ri)rə(1:— r2)°/16 O Ea O - rə)”/16 

nl- ri 2, /32 rı(1 — rı)r2/64 rı(1 — rı)r2/64 

n =n) n32 rı(1 — rı)r2/64 rı(1 — ri)r,/64 

n= 7 2, /32 rı(1 — rı)r2/64 rı(1 — rı)r2/64 

ni- ri 2, /32 rı(1 — ri)rə/64 rı(1 — ri)rə/64 

1-17 Srə(1 m rə) /8 r(1 — ri)rə(1 — r2)r/16 r(1 — ri)rə(1 — rə)r/ 16 

1- ri)”(1— rə)”/8 mi(l:— ri)rə(1 — rə)(1 — 1/16 ni(1 — ri)rə(1 — rə)(1 — r)/16 
1 — ri)”rnə(1 — rə) /16 r(1 — r,)(1 — rə)”/16 r(1 — nje = r2)?/16 
1-17 Srə(1 — rə)/16 mi — r))rə(1 - rə) /16 mn — ri)(1-— rə)” /16 

1 — ri)”, /32 r(1 — rı)r2/64 rı(1 — ri)r,/64 

1 -— r) 5, /32 rı(1 — rı)r2/64 rı(1 — rı)r2/64 

1 — 737/32 rı(1 — rı)r2/64 rı(1 — rı)r2/64 

Ler 37/32 rı(1 — rı)r2/64 rı(1 — rı)r2/64 
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Tas. 8.18 — (continued). 


Genotype Genotype Genotype at locus q (located between locus 1 and locus 2) 
at locus 1 at locus 2 

E,Eq FaFa Ga Ga HaHa 
ALA) A2Aş ryre(1 - 7/32 ryre(1 - 7/32 ryre(1 — 7/32 Tiro(1— 7/32 
ALA, By By ryrr(1 — r) /32 ryrmr(1 — r)/32 ryrmr(1 — r)/32 ryrer(1 — r)/32 
A, A, Cə Cə Tirer//64 Tirer/64 Tirer/64 Tirer/ 64 
ALA, DəD, Tıirer/64 Tirer/64 Tirer/64 Tirer/ 64 
ALA, EF» r(1 = rə)”/32 ryr(1 — rə)”/32 ryre(1 — r2)/64 ryre(1 — rə) /64 
ALA, F.F. rirə(1 — rə)”/32 r(1 — rə)”/32 ryre(1 — rə)/64 ryre(1 — rə) /64 
ALA, G2 Gə ryro(1 — r2)/64 rırə(1 — r2)/64 r(1 — r2)3/32 ryre(1 — r) 7/32 
ALA, HH, ryre(1 — r2)/64 ryre(1 — r2)/64 ryr(1 — rə)”/32 r(1 — rə)”/32 
BB, Ay Ag ryrr(1 — r)/32 rıror(1 — r)/32 ryrmr(1 — r)/32 ryrr(1 — r)/32 
B,Bı BoBo ryr(1 — r)?/32 ryr(1 — r)?/32 ryr(1 — r)?/32 rirə(1 — r)?/32 
BB, Cə Cə Tirer//64 Tirer/64 Tirer/64 Tirer/ 64 
BB, DD> Tirer//64 Tirer/64 Tirer/64 Tirer/ 64 
B,Bı EF» r(1 — rə)”/32 rirə(1 — rə)”/32 ryre(1 — r2)/64 ryre(1 — rə) /64 
BB, F.F, ryre(1 — rə)”/32 r(1 — rə)” /32 rirə(1 — r2)/64 mirə(1 — r2)/64 
B,B, GoGo ryre(1 — r2)/64 ryre(1 — r2)/64 ry(1 = rə)”/32 ryr(1 — rə)”/32 
Bibi Ay Ha Tirə(l zr rə)/64 Tirə(1 a r2)/64 Tirə(l1 pz rə) /32 Ti(l iə rə)” /32 


Note: the three loci are denoted as 1, q and 2; locus q is located between locus 1 and locus 2; ri, rə, and rare the recombination frequencies between locus 
1 and locus q, between locus q and locus 2, and between locus 1 and locus 2, respectively, where r = ri +m — 2rirə, i.e., assuming that the 
recombination events in two neighboring intervals are independent. 


98E 


Surddeyy ouex) pue sısAyeuy əSeyur? 


TAB. 8.19 — Theoretical frequencies of partial genotypes at three complete loci in the eight-parental RIL population. 


Genotype Genotype 
at locus 1 at locus 2 


AA AAs 
AA GC 
AA DoD» 
AA GG 
AA HH 
BB AAs 
BB ByBo 
BB Ce, 
BB DD» 
BB BE; 
BB FF» 
BB Gs 
BB HH, 


Genotype at locus q (between locus 1 and locus 2) 


yü — rA- R) 


1 — Hə)/8 


)rə(1 — r)(1 — Rı) 


1 — R)/8 
1)?ra(1 - Bi)(1 - Ra)/ 


1) (1 R) - Re)/ 


)?(1 - Rı)R2/32 
)?(1 - Rı)R2/32 
)?(1 — Rı)R2/32 
Pa Ree an 


(l= m) (1- Ri) 


Ma -R i 


.. — m a = R) 


C, C, 
nirə(l — r)(1- Ri) 
(1 - Bə)/16 


nirər(1 — Ry)(1 — Rə)/ 
16 
r(1 — rə)”(1 — Ry) 


(1 = Rə)/16 
ryme(1 — r)(1 — R) 
(1 — Rə)/16 


n(1 — Rı)R2/64 

r(1 — R,) R2/64 

n(1 — Ry) Ro/64 
rı(1 — Ry) R2/64 

ryrer(1 — Ry)(1 — Rə)/ 

16 

nil — 11 - Ry) 


(1 — Rə)/16 

n(1 — rə)”(1 - RU) 
(1 — R,)/16 

ryre(1 — r)(1 — Ri) 
(1 — Ry) /16 

rı(1 — Ry) R2/64 
n(1 — Ry) Ro/64 
rı(1 — Ry) R2/64 
rı(1 — Ry) Ro/64 


D,D. 
rıra(1 — r)(1 - Ri) 
(1 - Re) /16 


ryrer(1 — Ry)(1 — Rə)/ 
16 
nirə(l — ra)(1 — R) 


(1 = Rz)/16 
r(1 — rn) (1- Ry) 
(1 — Re) /16 


rı(1 — Ry) R2/64 
r(1 — Ry) R2/64 
rı(1 — Ru) R2/64 
rı(1 — Ry) hə/64 
ryrer(1 — Ry)(1 — Rə)/ 
16 

ryro(1 — r)(1 -= hi) 


(1 — Re) /16 

rre(1 — rə)(1 — Ry) 
(1 = R)/16 

n(1 — rə)”(1 — Ri) 
(1 — Ry) /16 

rı(1 — Ry) Rə/64 
n(1 — Rı)Rə/64 
n(1 — Ry) R2/64 
m(1 — Ry) Ro/64 
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Genotype 
at locus 2 


Tas. 8.19 — (continued). 


Genotype at locus q (between locus 1 and locus 2) 


EE, 


FaF, 


GaG, 


AH. 


—  “—— “ “ —““ xa aa a 0. 


Genotype 

at locus 1 

ALA A.A, 
ALA Bo Bo 
ALA Cə C2 
ALA D.D, 
ALA G2 G2 
AA Ay Hy 
BB Ao Ag 
BB Bo Bo 
B, B. Cə Cə 
B.B D.D, 
B.B Ey Eo 
B.B PoP 
B, B Go Gp 
BB Ay Hy 


(1 — r)?R,R2/32 

r(1 — r) Ry Rə/32 

rR, R2/64 

rR, R2/64 

he n) Ry (1 — Ry) /32 
ra(1 — r2)Rı(1 — Rə)/32 


r2Rı(1 — Ro) /64 


r2Rı(1 — Ro) /64 


r(1 — r) Ry Rə/32 

(1 — r} R R2/32 

rR, R2/64 

rR, R2/64 

(1 — r) Rı(1 — Ry) /32 
ro(1 — r2)Rı(1 — Rə)/32 


rR, (1 — Ry) /64 


r2Rı(1 — Ro) /64 


(1 — r)?R, R2/32 

r(1 — r) RiRə/32 
rRiRə/64 

rRiRə/64 

rə(1 — ry) Ri(1 — Rə)/32 
(1 = rp)? Ry(1 — Rə)/32 


mR (1 = Ry) /64 


mR (1 sö Ry) /64 


r(1 — r) RiRə/32 

(1 — r)?R, R2/32 
rRiRə/64 

rRiRə/64 

rə(1 — rm») Ry(1 — R2)/32 
(1 = rp)? Ry(1 — Rə)/32 


Ri (1 — Ry) /64 


mR g“ Ry) /64 


(1 — r)?R, R2/32 
r(1 — r) RiRə/32 
rR, R2/64 
rR, R2/64 
ry Ry(1 — Ry) /64 
ry Ry(1 — Ry) /64 


(1 - r)?Ri(1 — Ro) /32 


Tə(l — rm) Ri(1 — Rə)/ 
32 

r(1 — r) RiRə/32 

(1 — r)?R, R2/32 

rR, R2/64 

rR, R2/64 

r»R,(1 — Rə) /64 
rəR,(1 — Rə)/64 


düz rə)” Ry(1 — Rə)/32 


rə(l — mr) Ri(1 — Rə)/ 
32 


(1 — r)?R, R2/32 

r(1 — r) Ry Re/32 

rR, R2/64 

rR, R2/64 

rəRı(1 — Ry) /64 
rəRı(1 — Ry) /64 

n(1 — 7) Ri(1 = hə)/ 
32 


(1 — m) Ri(1 — Ry) /32 


r(1 — r) Ry Rp/32 

(1 — r)’ Rı R2/32 

rR, R2/64 

rR, R2/64 

rəRy(1 — Ro) /64 
rəR,(1 — Ry) /64 

(1 — r2)Rı(1 — Rə)/ 
32 


(1 = r)?Ri(1 — Ro) /32 


Note: the three loci are denoted as 1, q and 2; locus q is located between locus 1 and locus 2; ri, rə, and r are the recombination frequencies 


between locus 1 and locus q, between locus q and locus 2, and between locus 1 and locus 2, respectively. r = r) + r) — 2r m, i.e., assuming that 


the recombination events in two neighboring intervals are independent. R, and Rə are the cumulated recombination frequencies during repeated 


selfing between locus 1 and locus q, and between locus q and locus 2, respectively. 


88E 
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cumulated recombination frequency during repeated selfing, i.e., Ry = — and 
— _2n 
R= 1427" 


When no distortion occurs, the eight homozygous genotypes at one complete 
genetic locus each account for one-eighth in the eight-parental DH and RIL popu- 
lations, which can be used to impute the completely missing marker information. 
Tables 8.10 and 8.11 give the theoretical frequencies of 64 homozygous genotypes at 
two complete gene loci, which can be used to impute the incomplete marker infor- 
mation when there is only one complete marker linked with the current incomplete 
marker to be imputed. When there are two complete linked markers, the current 
incomplete marker located between the two complete markers can be imputed by the 
theoretical frequencies as given in tables 8.18 and 8.19. The imputation procedure is 
similar to what has been introduced in §8.3.2. After the imputation, all markers are 
converted to category ABCDEFGH without any missing and incomplete 
information. 


8.4.2 The Linear Regression Model of Phenotype 
on Marker Types 


Assume Aqa, By, ..., Hq are the eight alleles harbored in the eight inbred parents at 
one QTL. Genotypic values of the eight QTL genotypes are given in equation 8.21 
by the one-locus additive model. 


Hy = UF a, wr (8.21) 


where k = 1-8 representing the eight homozygous genotypes at the QTL; u, is the 
kth genotypic value of QTL; u is the overall mean of the eight QTL genotypic values, 
or mean of the progeny population; az is the additive effect of the kth allele; and uy is 
an indicator of QTL genotype, valued at 1 for the kth parental allele, and 0 for the 
other parental alleles. 

On the other side, if the eight genotypic values are known, the population mean 
and additive effects can be calculated by equation 8.22. 


H =i 5 Uk, Qk =; (iu = 5 2 (k = 1-8) (8.22) 


k=1-8 l=1-8,14¢k 


When there was no segregation distortion, the genetic variation contributed by 
the QTL is given by equation 8.23. 


1 
Ve >) a (8.23) 
8 k=1-8 


One restriction has to be made so as to estimate the nine genetic parameters in 
equations 8.21 and 8.22, i.e., the sum of the eight effects has to be equal to 0. To 
avoid the complexity caused by the restricted condition in parameter estimation, 
one orthogonal model is built, which is equivalent to equation 8.21, but without any 
restrictions. Take locus q as an example, the orthogonal variables and their values 
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are shown in table 8.20. Indicators u, v, and win the table can only be equal to either 
1 or —1, resulting in eight combinations standing for the eight homozygous geno- 
types. The other four indicators are calculated from the three basic indicators, i.e., 
uv, uw, vw, and uvw. The readers can show the orthogonality of the eight variables 
by working out exercise 8.12. 


TAB. 8.20 — Values of the orthogonal indicators for eight homozygous genotypes at one 
genetic locus. 


Genotype Overall mean wu v w uxXv uXw vXw uXvXw 
AgAg 1 1 1 1 1 1 1 1 

BaBa 1 1 1 -1 1 1 1 1 

CaCa 1 1 -1 1 =1 1 -1 -1 

DaDa 1 1 1 1 1 1 1 

Eye 1 =] 1 1 -1 =l 1 =l 

FaFa 1 -1 1 mi =] 1 mi 1 

Gy Gq 1 -1 -1 1 1 -1 =] 1 

HH, 1 =L- =l =i _ 1 1 =] 


Assuming there are a number of m QTLs in the population, the genotypic value 
of the jth QTL is represented by equation 8.24 by using the orthogonal model. 


Gj = p+ dirty + bp) + bzw + baujvj + djs uy wy + bovjw + byrujvjwy (8.24) 


where uy, v, and uy are the orthogonal indicators of the jth QTL genotype, taking 
the same values as those defined in table 8.20. The relationship of parameters 
defined in equation 8.21 with those defined in equation 8.24 is given in exercise 8.12. 

Under the assumption of additivity on genotypic effects from different QTLs, the 


total genotypic value of the pure-line progeny can be given in equation 8.25. 


G= uc (Dani t bya vj t biş wy t by Uy Uj t bjs uyun, + bj vj uş + bj7 u,v; W;) (8.25) 
el 


Similar to ICIM as has been introduced in chapters 5 and 7, we can start from 
equation 8.24 and build the inclusive linear model between the genotypic value of 
the jth QTL and the flanking markers, and then use equation 8.25 to build the 
inclusive linear model of the genotypic value of pure-line progenies depending on the 
whole-genome markers. Assume there are a number of m QTLs located on m 
intervals defined by m + 1 markers on one chromosome. There is at most one QTL 
in one marker interval. For the intervals without QTL, the QTL effects are set at 0. 
Similar to the orthogonal indicators on QTL genotypes as defined in table 8.20, 
orthogonal indicators are also defined for markers, and the linear regression model of 
phenotypic values on markers can be derived and given in equation 8.26 for the 
pure-line progenies. 


m+1 
P= U+ 5 (x) + Pjyi+ 7)“i + Ti UY; + Aj; 2} + öz) + təbi) FE (8.26) 


j=l 
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where P is the phenotypic value of the pure-line progeny; e is random error assumed 
to be normally distributed with mean at 0; a,, Bj, Yj, Tj Aj 6?, and ¢; are effects of the 
jth marker caused by QTL (to be estimated); zy, y; and zj are the orthogonal 
indicators of the jth marker, and their values for the eight marker types are the same 
as those for the eight QTL genotypes given in table 8.20. 

Similar to the four-parental pure-line populations, QTL effects can also cause 
interactions between flanking markers. To reduce the number of parameters in the 
regression model, the interaction effects between markers are not considered in 
equation 8.26. Strictly speaking, equation 8.26 is not an inclusive linear model of the 
phenotype on the whole-genome markers. When the recombination frequencies 
between QTL and the linked markers are not equal to 0, there may be part of the QTL 
variation which cannot be fitted in the linear model in equation 8.26. The unfitted 
genetic variation will be added to the random error in estimation and testing. 


8.4.3 Inclusive Composite Interval Mapping (ICIM) 
in Eight-Parental Pure-Line Populations 


Firstly, by considering all marker information at the same time, equation 8.26 is 
used to select the most important variables. The coefficients of those variables not 
retained by stepwise regression are set at 0. Secondly, phenotypic values are adjusted 
by significant markers selected by stepwise regression, i.e., 


AP; = P,— 5 (2 zü) i Biya i 2,2 + Êj Lig yi + A jij 2ij öy Üməyizq) (8.27) 
pekhal 


where P, is the phenotypic value of the ith pure-line progeny (i = 1, 2, ..., n, and nis 
the population size); k and k + 1 represent the two flanking markers of the current 
scanning position; the hat symbol represents the estimated value of the parameter; 
Tij Yip and zy are the orthogonal indicators of genotypes of the ith progeny at the jth 
marker. The adjusted phenotypic value AP; contains information on QTL position 
and effects at the current interval, and in the meantime, most genetic variations of 
QTLs out of the current interval have been excluded by the adjustment. 

Interval mapping is conducted on the adjusted phenotypic values. Phenotypic 
observations of the eight homozygous QTL genotypes follow normal distributions 
with different means but the same variance, i.e., N(u,, o°), k = 1-8. The null and 
alternative hypotheses used to test the existence of QTL are, 


Ho : j = My = + = hg 
Ha : at least two of Hi, Hə)... düş are not equal 


The logarithm likelihood under HA is, 


64 8 
Inla=S°S >In b Tif (AP; Hp, 2 (8.28) 


j=l i€S) 
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where 5) represents the individuals belonging to the jth marker class of two flanking 
markers (j = 1-64 corresponds to the 64 genotypes of two flanking markers); zi, 
(k = 1-8) is the proportion of the kth QTL genotype in the jth marker class; and 
f(e; uz, o?) represents the density function of normal distribution N(u,, o”). 

The EM algorithm is used for maximum likelihood estimation in equation 8.28. 
The detailed process is similar to that introduced in §8.3.4, and will not be given 
here. Under Ho, genotypic values of the eight QTL genotypes follow the same normal 
distribution N(up, og). Maximum likelihood estimation of mean and variance is also 
similar to that introduced in §8.3.4. Finally, LRT statistic and LOD score are cal- 
culated from the maximum likelihood functions under the two hypotheses and then 
used to test the existence of QTL. Values of parameters corresponding to the 
maximum likelihood functions under the alternative hypothesis are used as maxi- 
mum likelihood estimations of the parameters to be estimated. 

In the ICIM algorithm as introduced above, if phenotype P is used in equa- 
tion 8.28 instead of the adjusted phenotype AP in the one-dimensional scanning, the 
method becomes simple interval mapping (IM). In addition to ICIM and IM, the 
GAPL software also implements the single marker analysis (SMA) for the 
eight-parental DH and RIL populations. Figure 8.5 shows the LOD histogram from 
SMA, and LOD score profiles from IM and ICIM in one simulated eight-parental 
population consisting of 500 pure lines. It can be seen that for the three methods, 
there are markers or positions with significantly high LOD scores on the first six 
chromosomes. If the LOD score threshold is set at 6.0, there are QTLs affecting the 
phenotypic trait on these chromosomes. LOD scores from IM (figure 8.5B) and SMA 
(figure 8.5A) are similar. If the data points in figure 8.5A are connected to make a 
line, the line is similar to the LOD profile from IM. If peaks on the LOD profile 
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Fic. 8.5 — Bar graph of LOD score from single marker analysis (SMA) (A), and profiles of 
LOD score from simple interval mapping (IM) and inclusive composite interval mapping 
(ICIM) (B) in a simulated eight-parental RIL population. 
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TAB. 8.21 — Mapping results from ICIM in a simulated eight-parental RIL population. 


LOD score 


17.64 
8.16 
15.25 
22.81 
13.12 
9.37 


PVE (70) 


11.03 
4.14 
9.34 
14.10 
8.60 
5.83 


A (a) 
—6.66 
0.40 
—2.13 
—15.05 
12.27 
7.45 


B (a) 
12.71 
—5.65 
14.81 
—22.41 
—2.94 
8.58 


C (as) 
—11.23 
—0.76 
—15.08 
0.72 
8.14 
—8.30 


Note: the highest additive effect at each detected QTL is highlighted in bold. 


D (aa) 
19.82 
0.72 
0.37 
12.35 
—16.80 
—3.69 


E (as) 
—7.35 
16.56 
—9.39 
14.84 
1.14 

0.30 


Genetic effect of the eight parental genotypes 


F (aş) 
—12.34 
—3.15 
—10.34 
12.40 
—11.36 
13.33 


G (az) 
—2.27 
—6.37 
13.59 
—5.62 
—2.13 
—9.85 


suoreymdoq AuəSorq əury-əmq feşuəreq-uyyy ur sisAiyeuy 910005 
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higher than the LOD threshold are viewed as QTLs, mapping results from the two 
methods are actually the same. Both methods have no background control in QTL 
detection. If the linkage map is dense enough, the two methods are always close to 
being equivalent. These results have been observed previously from bi-parental and 
four-parental mapping populations. 

There are much more parameters to be estimated in the eight-parental pure-line 
populations. At the same condition of map length and marker density, LOD score 
thresholds for the eight-parental pure-line populations are much larger than those 
for the four-parental and bi-parental pure-line populations. When the genome-wide 
type I error is set at 0.05, the LOD score threshold for the simulated eight-parental 
RIL population used in figure 8.5 is equal to 6.07 from 1000 times permutation test. 
Under the same map length and marker density, the LOD score threshold is around 
4.00 for the four-parental pure-line populations and around 3.00 for the bi-parental 
pure-line populations. For the simulated population in figure 8.5, the three 
pre-defined QTLs on chromosomes 1, 3, and 5 are all located at 25 cM; the three 
pre-defined QTLs on chromosomes 2, 4, and 6 are all located at 55 cM; there are no 
QTLs defined on chromosomes 7 and 8. It can be seen from the mapping results, as 
shown in table 8.21, the six detected QTLs are all close to the six pre-defined QTLs. 
There is a positive correlation between LOD scores and PVEs of the detected QTLs. 
The eight additive effects at each QTL are much different. The alleles with the 
largest effects come from parents D, E, B, E, A, and F at the six detected QTLs, 
respectively. Therefore, the pure-line progeny combining these parental alleles will 
have the largest phenotypic value. 


Exercises 


8.1 In table 8.6, m,—my2 are the observed sample sizes of 12 identifiable genotypes, 
and n is the total population size. Show the following equation for the maximum 
likelihood estimate of recombination frequency, based on the theoretical frequencies 
of identifiable genotypes in the DH progenies. 


72:3 + 15:7 + 9:11 
n + Ng:9 + N411:12 


r= 


8.2 In a four-parental DH population, assume two markers belong to cate- 
gory ABCC and AAAD, respectively. The following table gives the relationship 


Allele Sixteen genotypes at two Six identifiable genotypes at 
complete markers two markers belonging to 
categories ABCC and AAAD 
A B C A B C D 
A 1 2 3 4 1 1 1 2 
B 5 6 7 8 3 3 3 4 
C 9 10 11 12 5 5 5 6 
D 13 14 15 16 5 5 5 6 
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between the six identifiable genotypes (numbered by 1-6) and the 16 complete 
genotypes (numbered by 1-16). 


(1) Use the theoretical frequencies of complete genotypes as given in table 8.4 to 
calculate the theoretical frequencies of six identifiable genotypes at the two 
incomplete loci. 

(2) Ifthe EM algorithm is used to calculate the recombination frequency between 
the two incomplete loci, how can the observed sample sizes of incomplete 
genotypes 1, 3, 5, and 6 be split to the complete genotypes? 


8.3 In a four-parental pure-line population consisting of 90 DH lines, genotypic data 
at two complete markers are shown in the following table, where A, B, C, and D 
represent the four homozygous genotypes AA, BB, CC, and DD in the population, 
respectively. Calculate the observed sample sizes of the 16 complete genotypes in the 
population, and use equation 8.4 to calculate the recombination frequency between 
the two marker loci. 


Range of Marker DH lines 
DH lines 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 
1-15 Mı A B B C D BD C C B B C B C B 
Mə A C B C D C D C C B B C B A B 
16-30 Mı B A A C D A C A CDB C B C € 
Mə B A A C D A C A C D B C A C € 
31-45 Mı C D D C D B A B BD D D A D B 
Mə C D D C D B D B B D D D A D B 
46—60 Mı A C D A B C B BDA  C D D B 
Mə D C D A D C A B D B A C D D B 
61-75 Mı A B C D D C B B A C B CG CG C D 
Mə A B C D D C B B A C BC CC € 
76-90 Mı D B C C D A C B CD C A BCD 
Mə D B C D D A B B C D C A B C B 


8.4 In exercise 8.3, assume the two markers belong to categories ABCC and AAAD, 
i.e., C and D at M) cannot be separated, and A, B, and C at Mə cannot be separated. 
Calculate the observed sample sizes of six identifiable genotypes in the population, 
and use the EM algorithm to calculate recombination frequency between the two 
marker loci. 


8.5 In a four-parental DH population, assume that two linked markers both belong 
to category ABAB. The following table gives the relationship between the four 
identifiable genotypes at the two incomplete marker loci having 16 complete 
genotypes. Use the theoretical frequencies of complete genotypes in table 8.4 to 
calculate the theoretical frequencies of the four identifiable genotypes at the two 
incomplete loci. 
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Allele Sixteen genotypes at two Four identifiable genotypes at 
complete markers two markers both belonging to 
category ABAB 
A B C A B C D 
A 1 2 3 4 1 2 1 2 
B 5 6 7 8 3 4 3 4 
C 9 10 11 12 1 2 1 2 
D 13 14 15 16 3 4 3 4 


8.6 Genotypes of two pure-line parents at two linked loci are represented by 
AAAA and B,B,ByBs, respectively. The recombination frequency between the 
two linked loci is r. Assume that the haploid doubling is conducted in their Fə hybrid 
to generate one bi-parental DH population. The following table gives the transition 
matrix from the single cross Fə to the DH progenies. Show that in the bi-parental 
DH population, theoretical frequencies of homozygous genotypes A,A,A A» and 
BiBiBəBə are both equal to 102 — 3r--2r?), and theoretical frequencies of 
homozygous genotypes A;A;B 2B, and BiB)AŞAŞ are both equal to 1(3r — 275). 
Frequencies of the four homozygous genotypes are exactly the same as those of the 
four identifiable genotypes given in table 8.5. 


Genotype in the single Frequency Genotype in the DH progenies 

Yalı Te population Ay AiAgA2 AABB» BiB.AşAə BiB)B)B) 
1 2 

AA/AAA nin r) 1 0 0 0 
1 1 1 

AA/AAA Bə gr mü r) 2 2 0 0 

A,A5/B,A ə r) : 0 : 0 

142/ B1 A2 9 z 2 

1 2 1 1 1 1 

4142/ BBs s-r? 50- y 5° s(l-7) 
Pi 

A, Bo/ A, Bə F r? 0 1 0 0 
1. 1 1 1 1 

AıB2/ By A» ar ə 5 (1 r) zC r) ” 
1 1 0 1 

A, By/B, Bə gr “uu 2 2 
1. 

B, Ao/B, Az 1 r? 0 0 1 0 

By A2/ By B. b r) 0 0 i i 

142/Bı Bə 2 2 2 

1 2 

Bı B2/ Bı Bo 7 —r) 0 0 0 1 
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8.7 In an eight-parental DH population, there are 16 identifiable genotypes at two 
marker loci both belonging to category ABCDABCD. 


(1) Similar to table 8.13, give the relationship between the 16 identifiable genotypes 
and the 64 complete genotypes. 

(2) Use the theoretical frequencies of complete genotypes given in table 8.10 to 
calculate the theoretical frequencies of the 16 identifiable genotypes. 

(3) If the two loci in table 8.12 both belong to category ABCDABCD, give the 
observed sample sizes of the 16 identifiable genotypes. 


8.8 In an eight-parental DH population, there are 16 identifiable genotypes at two 
marker loci belonging to categories AABBCCDD and ABCDABCD. 


(1) Similar to table 8.13, give the relationship between the 16 identifiable genotypes 
and the 64 complete genotypes. 

(2) Ifthe EM algorithm is used to calculate the recombination frequency between 
the two marker loci, how can the observed sample size of the first identifiable 
genotype be split to the complete genotypes? 

(3) If the two loci in table 8.12 belong to categories AABBCCDD and ABC- 
DABCD, respectively, give the observed sample sizes of the 16 identifiable 
genotypes. 


8.9 Assuming there is one QTL in a four-parental DH population, phenotypic means 
of the four homozygous genotypes at the QTL are equal to 80, 66, 72, and 82, 
respectively. The recombination frequency between one complete marker and the 
QTL is equal to 0.1. 


(1) Calculate the additive effects of the four alleles, population mean, and genetic 
variance. 

(2) Taking locus 1 in table 8.4 as the marker and locus 2 as the QTL, calculate the 
theoretical frequencies of four homozygous QTL genotypes under each of the 
four marker genotypes. 

(3) Calculate mean and variance for each of the four marker genotypes, and com- 
pare them with the population mean and genetic variance calculated in (1). 


8.10 Assuming there is one QTL in a four-parental RIL population, phenotypic 
means of the four homozygous genotypes at the QTL are equal to 80, 66, 72, and 82, 
respectively. The recombination frequency between a complete marker and the QTL 
is equal to 0.1. 


(1) Calculate the additive effects of the four alleles, population mean, and genetic 
variance. 

(2) Taking locus 1 in table 8.5 as the marker and locus 2 as the QTL, calculate the 
theoretical frequencies of the four homozygous QTL genotypes under each of 
the four marker genotypes. 

(3) Calculate the mean and variance for each of the four marker genotypes, and 
compare them with the population mean and genetic variance calculated in (1). 
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8.11 In a four-parental pure-line population, assume the four homozygous genotypic 
values of one QTL are 44, H2, 43, and u4, respectively. The orthogonal model given in 
equation 8.15 is further denoted as, 


Hy = B+ bi + bə + b3, Hə = u+ bi — bə — b3, 
H3 = pw bi + b2 — bs, py = H — bi — İz + bg 


(1) Work out the solutions of the linear equations. 

(2) Assuming the estimates of four homozygous genotypic values at a scanning 
position are equal to 20.5, 32.4, 22.6, and 16.9, respectively, work out the 
additive effects of the four alleles, and values of the four variables in the 
orthogonal model. Show the relationship given in equation 8.16. 


8.12 Show that the design matrix given in table 8.20 is orthogonal, i.e., XTX is a 
diagonal matrix. Show the following relationship between parameters $, (k = 1-7) 
in the orthogonal model defined in equation 8.24 and the eight additive effects 
defined in equation 8.21. 


_ dı + AQ + aş + a4 — di + AQ + aş + ağ _ dı T aş + aş +a 


bi 1 , by 1 , bg 4 
dı + ap + a7 + ag a, + aş + ag + ag dı + da + aş + ag 
bg = — — — — —— , b = , b6 = , 
4 4 4 
p — Ut du Tas Tar 
= 


Chapter 9 


QTL Mapping in Other Genetic 
Populations 


Mendel’s hybridization experiments on garden pea (Pisum sativum) established the 
classic genetics methodology using the segregating populations of artificially con- 
trolled crosses to study the inheritance of phenotypic variations on biological traits. 
Genetic populations in plants are generally derived from two or a few parents. 
Chapters 2-8 of this book describe in detail the genetic composition and analysis 
methods in progeny populations derived from two, four, and eight parents. In humans 
and animals, genetic populations, in general, consist of individuals from one or more 
nuclear families. These populations are generated in such a way that selection at the 
gamete level as well as at the zygote level is avoided as much as possible, or only 
low-intensity of natural selection is present. Populations thus obtained have specific 
expected allelic frequencies and genotypic frequencies at each locus, following or 
approximately following the known Mendelian segregation ratios. In such popula- 
tions, the association between two genetic loci reflects their linkage relationship on 
the chromosome. The recombination frequency or crossing-over rate can therefore be 
estimated from the association between marker loci, and thus the genetic linkage 
maps can be constructed (see chapters 2, 3, §7.1-§7.3 in chapter 7, and $8.1-$8.2 in 
chapter 8). Subsequently, genetic studies can be carried out to locate genes that 
control the phenotypic traits through the associations between marker loci and traits 
(see chapters 4-6, $7.4 in chapter 7, and $8.3-$8.4 in chapter 8). Genetic mapping 
methods developed for these populations are sometimes referred to as linkage 
analysis. 

Selection occurs at varying degrees in most breeding populations. This chapter 
describes the mapping methods in some kinds of selected populations. §9.1 intro- 
duces the analysis methods in selected bi-parental populations, where the selection is 
conducted on phenotypic traits. §9.2 introduces QTL mapping in populations 
consisting of chromosomal segment substitution lines (CSSLs) which are produced 
by repeated backcrossing and selecting, where the selection is primarily based on 
genotypes by marker screening. §9.3 describes the mapping method in nested 
association mapping (NAM) populations which are produced by crossing multiple 
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parents with one common parent. Using the grain width gene gu+5 in rice as an 
example, §9.4 outlines the general procedure for fine mapping, map-based cloning, 
and functional analysis of quantitative trait genes. The last section introduces 
briefly the association mapping methodology applicable to natural populations. 


9.1 Selective Genotyping Analysis and Bulked Segregant 
Analysis 


The analysis methods described in chapters 4-8 require both genotypic and phe- 
notypic information on all individuals or lines included in the entire mapping pop- 
ulation. Genotyping can be expensive and time-consuming if the population is large. 
To reduce the genotyping expenses, selective genotyping analysis (SGA) was pro- 
posed at the very beginning of QTL mapping studies (Sun et al., 2010; Navabi et al., 
2009: Wingbermuehle ef al., 2004; Darvasi and Soller, 1992, 1994; Lander and 
Botstein, 1989; Lebowitz et al., 1987). SGA only conducts the genotypic screening 
for a two-tailed or one-tailed populations consisting of individuals with the highest 
and lowest phenotypic values. There are two types of selective genotyping: 
two-tailed analysis and one-tailed analysis. It has been shown that the two-tailed 
analysis is more efficient if the ratio of costs in genotyping and phenotyping is 
greater than one; the one-tailed analysis is more effective if the ratio is greater than 
two (Gallais et al., 2007). 


9.1.1 Statistical Principles of Selective Genotyping 
Analysis 


Assume that one QTL is linked with one marker locus M. Selection causes the 
change in frequency of the genes controlling the phenotypic trait, which also causes 
the frequencies of linked markers to deviate from the expectation in the absence of 
selection. For marker loci that are not associated with any gene on the trait, their 
frequencies will remain unchanged before and after selection. SGA tests for the 
presence of QTL by the change in allelic frequencies at each marker locus. As shown 
in figure 9.1, marker locus M is assumed to be polymorphic between the two parents, 
denoted by M and m, respectively. Polymorphism M is linked to the increasing allele 
and polymorphism m is linked to the decreasing allele. Then in the bottom-tailed 
population, the frequency of M will decrease, and the frequency of m will increase; in 
the top-tailed population, the frequency of M will increase, and the frequency of 
m will decrease. In one-tailed analysis, the change in frequency can be measured by 
the difference in frequency between the selected and un-selected populations. In 
two-tailed analysis, the change in frequency can be measured by the difference 
between the two tails. One marker is considered to be linked to one QTL if the 
marker frequency is changed significantly before and after selection; otherwise, the 
marker locus is considered not to be linked to any QTLs on the selected trait. 


QTL Mapping in Other Genetic Populations 401 


Top-tailed 
selection 
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m M is linked to the increasing allele 
No linkage 
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Frequency of marker M 


Bottom-tailed selection No selection Top-tailed selection 


Fic. 9.1 — Schematic representation of the change in marker frequency in two tails as well as 
in the unselected population. 


Denote the frequencies of two alleles at one locus by po and qo in the unselected 
population, respectively. One RIL population is shown in figure 9.1, where po = 
qo = 0.5. The expected allele frequencies are different in different populations. 
In RIL, DH, and F, populations derived from the F,; hybrid, both alleles at each 
locus have the expected frequency at 0.5. In populations derived from BCF,, two 
alleles at each locus have frequencies of 0.75 and 0.25 (see table 1.5 in chapter 1). In 
the two-tailed populations, pp and qı represent the frequencies of two alleles at one 
locus in the bottom tail, p; and q represent the frequencies of two alleles in the top 
tail, and ny and n; represent the total numbers of alleles in the two tails, respectively 
(see §1.2, chapter 1 for the calculation of allelic number and frequency). Variances of 
allele frequencies in the two tail populations can be obtained from the binomial 
distribution, i.e., equation 9.1. 


Pq 
Via) =", Via) == (9.1) 


Tü 


Aq = qı — qı denotes the difference in allele frequencies in the two-tailed popu- 
lations. The variance of this difference is given in equation 9.2. 


V(Aq) = Pee? 4 Pett (9.2) 


Tip Tü 


If only a, one-tailed population is used, the difference in allele frequencies is 
denoted by Aq = qt — qo, using the top-tailed population as an example. The 
variance of this difference is equal to variance of gene frequency q: in equation 9.1. 
The £-test statistic given in equation 9.3 can be calculated from the variance in 
equation 9.1 or equation 9.2, and then used to test for the significance of Aq from 
zero. 


402 Linkage Analysis and Gene Mapping 


_ İAql 
t= TS (9.3) 


9.1.2 Likelihood Ratio Test and LOD Score Statistics 
from Selective Genotyping Analysis 


Significant markers detected by the #statistic in equation 9.3 can be considered to 
be associated with QTLs controlling the phenotypic trait. However, in QTL map- 
ping, statistic such as the likelihood ratio test or LOD score is more commonly 
adopted. Two-tailed selection is used as an example to illustrate the calculation of 
the LOD score for selective genotyping analysis to detect QTLs. The null and 
alternative hypotheses of the test are, 


Ho : Aq = 0 (org = qı), and HA : qı Æ qı (9.4) 


Let mış and mi, denote the numbers of one allele in the bottom and top-tailed 
populations, and məş and mə, denote the numbers of the alternative allele in two tails. 
Allele frequencies in the two-tailed populations are calculated by equation 9.5. The 
likelihood function under the alternative hypothesis H, is given in equation 9.6. 

Nb Tip Nit Nat 


Pb = > b= Pe = > t= — (9.5) 
Nb Np Met Met 


LA = C™(p,)™ (a) x C™(ps)™ (a) ™ (9.6) 


Under the null hypothesis, gene frequencies are calculated by combining the 
two-tailed populations, i.e., equation 9.7. Substitute the combined frequencies from 
equation 9.7 into equation 9.6 to obtain the likelihood function under the null 
hypothesis, i.e., equation 9.8. From equations 9.6 and 9.8, the likelihood ratio test 
(LRT) statistic and LOD score can be calculated. Obviously, the difference in the 
number of independent parameters from the two hypotheses of equation 9.4 is equal 
to one, and the LRT statistic approximately follows a x” distribution with one degree 
of freedom. 

Nib T Nit N2b T Mt 


Pi = Po = Po = , q: — qı — qo = (9.7) 
Np T Nt Tap T Nt 


Lo = Cr (po) ™ (go)™ x Cn (p0) (go) (9.8) 


Figure 9.2 gives the bar graph of the LOD score from single marker analysis and 
two-tailed selective genotyping analysis for kernel weight in the barley DH popu- 
lation. Obviously, LOD score from selective genotyping analysis at each marker has 
many similarities to single marker analysis, indicating that the two methods would 
make similar mapping results for the same mapping population. 
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Fic. 9.2 — Bar graph of LOD score from single marker analysis (A) and two-tailed selective 
genotyping analysis (B) for kernel vveight in the barley DH population. 


9.1.3 Bulked Segregant Analysis 


Bulked segregant analysis (BSA) forms two DNA pools by selecting the individuals 
with extreme phenotypic performances (such as highly resistant and highly suscep- 
tible to disease) in Fə or BC segregating populations. The number of extreme indi- 
viduals is generally equal to 5-10, and should not be too large. Polymorphic markers 
are screened between the two DNA pools, and markers showing polymorphisms 
between the two pools are assumed to be linked with the target gene. This approach 
was first proposed and applied in the detection of disease-resistance genes, and the 
two DNA pools formed by extreme individuals are sometimes called the resistant pool 
and susceptible pool (Barua et al., 1993; Michelmore et al., 1991). If one marker type 
at one locus occurs only in the resistant pool, and the other type occurs only in the 
susceptible pool, the marker and the phenotype are called co-segregating, indicating 
a tight linkage between the marker and the gene controlling the trait. 

Co-segregation between molecular markers and the resistant phenotypes indi- 
cates that the resistant allele frequency is equal to 1 and 0 in resistant and sensitive 
pools, respectively. BSA can be considered to be a special case of selective geno- 
typing analysis or an extreme case of selective genotyping analysis. That is to say, in 
the two-tailed selection, it is ensured that the one tail forming the resistant pool 
contains only the disease-resistant allele, and the other tail forming the sensitive 
pool contains only the disease-susceptible allele. Thus, when compared with selec- 
tive genotyping analysis, BSA requires more accurate phenotypic evaluation, as well 
as higher selection intensity. For the dominant resistant genes, both the dominant 
homozygous and dominant heterozygous genotypes have the same phenotype of 
resistance, and the selfed F; lines or backcrossing with the susceptible parent are 
required to identify whether the resistant Fə individuals are homozygous or 
heterozygous. This ensures that only the Fə individuals that are resistant in phe- 
notype and homozygous in genotype are selected to form the resistant DNA pool. 
Thus, the resistant allele is only present in the resistant DNA pool, and the sus- 
ceptible allele is only present in the susceptible DNA pool. 
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To further estimate the QTL effect and recombination frequency between QTL 
and the linked markers, the whole population can be genotyped only for the chro- 
mosomes showing polymorphism markers between the two pools. Analysis methods 
for the un-selected populations, as have been introduced in chapters 4-6, are then 
used to construct linkage maps for those chromosomes, map QTLs on these chro- 
mosomes, and estimate the genetic effects of the detected QTLs. Genotyping for the 
whole population analysis includes only molecular markers associated with the 
traits, and thus greatly reduces the expense and time in genotyping. 


9.1.4 Problems with Selective Genotyping Analysis 
and Bulked Segregant Analysis 


Selective genotyping analysis identifies molecular markers that are linked to QTLs 
from the changes in gene frequency. In general, it is difficult to estimate the 
recombination frequency or genetic distance between the marker and QTL and 
estimate the genetic effects of QTLs. Both selective genotyping analysis and bulked 
segregant analysis involve the selection of individuals based on phenotype, and the 
selected population can only be used for genetic mapping studies on the selected 
trait. If similar genetic studies are to be carried out for other traits, individuals will 
have to be re-selected on phenotypes from those traits. Therefore, the two methods 
are only suitable for genetic mapping for one single trait, based on which the 
selection is made. When performing the selection for one single trait, especially for 
bulked segregant analysis, accurate phenotypic evaluation is required. Therefore, 
the two methods are only suitable for traits controlled by one or a few major genes 
with high heritability. For traits with more complex genetic architectures, no 
obvious major-effect genes, and large random errors in phenotyping, satisfactory 
results may not be obtained by using the two methods. 


9.2 QTL Mapping in Populations of Chromosomal 
Segment Substitution Lines 


9.2.1 Characteristics of Chromosomal Segment 
Substitution Lines 


QTL mapping is commonly conducted in backcross, Fy, doubled haploids, recombi- 
nant inbred lines, or other bi-parental populations. Due to a large number of segre- 
gating loci and segregating chromosomal regions, it is sometimes difficult to 
completely exclude the interaction between QTLs, accurately estimate the positions 
and effects of QTLs, and study the interactions between different QTLs. In contrast, 
the substitution lines are different from the background parent only in a few chro- 
mosomal regions, which is helpful in fine mapping and gene cloning. In fact, most 
quantitative trait genes that have been cloned so far depend on the use of substitution 
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lines to a certain extent (Wan et al., 2005, 2006, 2008; Kubo et al., 2002). In addition, 
the combination of single- and double-segment substitution lines provides the ideal 
genetic materials to study the interactions between genes. Heterozygous chromoso- 
mal segment substitution lines can be generated by crossing between the fixed 
substitution lines and the background parent, allowing the study of the dominant 
effects of genes, as well. Although the process to generate these materials is 
time-consuming and expensive, once generated, they are the ideal genetic materials 
for fine mapping of genes, as well as for studying gene interactions; at the same time, 
they can also be used to confirm the QTLs detected in other mapping populations. 
Therefore, the development and utilization of chromosomal segment substitution 
lines (CSSLs) have received increasing attention in genetic studies (Zhao et al., 2009; 
Wan et al., 2005, 2006, 2008; Xu et al., 2007; Wang et al., 2006; Nadeau et al., 2000; 
Darvasi and Soller, 1995). 

One chromosome is used as an example to illustrate some genetic characteristics 
of the chromosomal segment substitution lines. For example, this chromosome is 
divided into 5 segments, denoted by $,—S;. There are five single-segment substitu- 
tion lines, denoted by 1CSSL1 to 1CSSL5. A total of 10 double-segment substitution 
lines can be generated as well, denoted by 2CSSL1 to 2CSSL10. For convenience, the 
background parent is denoted by OCSSL (figure 9.3). Between any one of 
the single-segment substitution lines and OCSSL, a genetic difference occurs only in 
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ocssL E-——— ss 1 1 İH Donor parent 


5 İBackground parent 


| 
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Fic. 9.3 — Five single-segment substitution lines and 10 double-segment substitution lines 
from one donor chromosome. Notes: (A) Fixed segment substitution lines; (B) F, hybrids 
from crosses between the fixed segment substitution lines and background parent. 
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one small chromosomal segment, with the rest segments coming from the back- 
ground parent. The two lines can be considered to be a pair of near-isogenic lines. 
For a pair of near-isogenic lines, phenotypic differences are caused by genes on the 
chromosomal region where the difference occurs, and thus the genes controlling the 
phenotypic difference can be mapped on the chromosomal segment of difference. 
Therefore, the study for the existence of genes controlling the phenotypic traits on 
the five chromosomal segments is reduced to the test for significant differences 
between the five single-segment substitution lines and the background parent 
OCSSL. For example, if genes controlling the phenotypic trait are present on both 
segments Sə and Ss, the test for significant differences between the single-segment 
substitution line 1CSSL2 and OCSSL can detect the genes on segment Sə. In the 
meantime, the test for segment Sə is independent of the significance test between 
1CSSL5 and OCSSL. Thus, in such a population of CSSLs, effects of linked QTLs on 
genetic analysis can be effectively excluded. Of course, if there are multiple genes 
present on a segment, the only way to break the linked genes is to break this segment 
into two or more shorter segments by further crossing and recombination. 

The combined use of double-segment lines and single-segment lines can effec- 
tively identify the presence of interaction between two chromosome segments. Still 
using segments Sə and S; as an example, OCSSL, 1CSSL2, 1CSSL5, and 2CSSL7 
provide the four possible homozygous genotypes, namely, 82598585, 92.55 sss, 52828595, 
and 525255 55. These genotypes have the same background chromosomal segments 
everywhere except on segments Sə and S5. The additive genetic effects of segments Sə 
and S; are assumed to be aş and as, and the interaction effect is aaəş. The average 
performance of various genotypes in relation to genetic effects is represented by the 
matrix in equation 9.9. 


S2 89 S5 85 1 -1 -1 1 m 
S2 52 55 55 1 1 —1 —l1 ag 
89 8) S5 Ss m 1 —l 1 —l 5 as (9:9) 
S2 52 55 Ss 1 1 ji 1 a25 
This yields the estimates of the genetic effects in equation 9.10. 
m 1 1 1 1 52 52 55 55 
ag - Il1ı 1 —1 —l S2 52 ss, 35 
a | 4İ1-—i 1 -Il” | %99% ... 
a25 1 —1 -1 1 Sə S2 55 55, 


Therefore, the presence of interaction between segments Sọ and Ss can be 
determined by whether s2528555— 92.52 ss ss” 82.82:95.55+5 525555 is equal to zero, or 
whether $9595585+5:525555 is equal to 5)5)5555+5 9595555. For replicated observa- 
tions, the significance of the genetic effect at each segment, and the interaction effect 
between two segments can be tested by the conventional analysis of variance as 
introduced in chapter 1. 

Figure 9.3B gives the heterozygous substitution lines resulting from crosses 
between the fixed substitution lines and background parent, and the genetic analysis 
is similar to that of the homozygous substitution lines, except that the effects 


QTL Mapping in Other Genetic Populations 407 


obtained have different interpretations in genetics. For example, the difference 
between single-segment lines 1CSSH2 and OCSSL represents the dominant effect of 
the heterozygous Sə segment, the difference between single-segment line 1CSSH5 
and OCSSL represents the dominant effect of the heterozygous S; segment, and the 
interaction effect estimated from ÜCSSL, 1CSSH2, 1CSSH5, and 2CSSH7 represents 
the dominance by dominance interaction between the two segments. 


9.2.2 Mapping Methods in Populations of Chromosomal 
Segment Substitution Lines 


It is a difficult task and a lengthy procedure to generate the ideal population of 
single-segment substitution lines as shown in figure 9.3, and to ensure that the 
donor segments in these substitution lines can cover the entire genome of the 
donor parent. Genotypic data for 65 rice substitution lines on 82 chromosomal 
segments are given in figure 9.4. The background parent is the japonica rice 
variety Asominori, and the donor parent is the indica rice variety IR24. The 
linkage map was constructed by a population of recombinant inbred lines derived 
from Asominori and IR24. The substitution lines were generated by repeatedly 
selecting the donor segments in backcrosses of recombinant inbred lines with the 
background parent Asominori, supplemented by the foreground marker-assisted 
selection (see Wan et al., 2004; Kubo et al., 2002 for more details on the popu- 
lation development). Each substitution line carried one to ten chromosomal seg- 
ments from the donor parent IR24, and on average, each IR24 segment was 
present in 3.7 substitution lines, and each substitution line has 4.6 segments of 
IR24. The two parents and 65 substitution lines were grown in multi-year and 
multi-location environments, and a number of agronomic and quality traits were 
investigated (Wan et al., 2006). 

In conventional bi-parental populations as discussed in chapters 2—6, two parents 
are also genotyped and phenotyped, but the parents are never included in genetic 
analysis. As can be seen in figure 9.3, genetic analysis of the CSSL population 
focuses on the comparison of each substitution line with the background parent, 
where the background parent has to be present. In a population containing a 
number of n ideal single-segment lines, it can be shown that the correlation 
coefficient between marker variables, representing the chromosomal segments, is 
r = — 4. If the background parent is included, this correlation coefficient becomes 
r =-—1. Therefore, in the idealized population of single-segment lines, correlation 
between marker variables is low, which is helpful in accurately estimating the effects 
of marker variables. However, the donor parent should not be included in genetic 
analysis. In fact, when the donor parent is included in a population of n ideal 
single-segment lines, the correlation coefficient between marker variables becomes 
r = yi), close to 0.5. The greater the correlation between marker variables, the 
more difficult it is to estimate the effects of marker variables accurately (Wang et al., 
2006, 2007). 
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In the following analysis, a population of n substitution lines is assumed, and 
the linear model between phenotype and the segment genotype for each substitution 
line is, 


t 
yi = bo + 5 bj xij + Ci (9.11) 
j=l 
where 7 = 0 denotes the background parent; i = 1, 2,..., n denotes each substitu- 


tion line; bp is the constant term in the linear regression model; b; (j = 1,..., t) is the 
regression coefficient of the jth segment; xj is indicator variable of chromosomal 
segment represented by marker, with the background chromosomal segment equal to 
—1, and the donor chromosomal segment equal to 1; e; is the residual effect, which 
follows a normal distribution with mean 0. 

For the population given in figure 9.4, there are severe correlations between 
marker variables in the linear model of equation 9.11, statistically known as 
multi-collinearity. For example, Mi, and Mis, Mog and Məz, Mes and Məaz, and M7; 
and M7. have correlation coefficients equal to 1. When multi-collinearity is severe, 
the coefficients in equation 9.11 are difficult to be estimated accurately. Statistically, 
the severity of multi-collinearity can be measured by variable inflation factor and 
condition number (Stuart et al., 1999), and the definition of the condition number is 
presented here. For the 82 markers in figure 9.4, the marker correlation matrix can 
be computed based on the indicator variables in the substitution line population, 
and the maximum and minimum eigenvalues of the correlation matrix are denoted 
by Amax and Amin, respectively. The condition number k is defined as the ratio of 
maximum and minimum eigenvalues, i.e., k = Amax/Amin. Experience shows that 
multi-collinearity should be of concern when k exceeds 100; when k exceeds 1000, 
multi-collinearity is severe, and the variance of the estimated coefficient in 
equation 9.11 could be very high. 

There are several statistical methods that can be adopted to reduce 
multi-collinearity. However, it is difficult to give out a definitive answer as to which 
method is more appropriate for genetic analysis. Table 9.1 gives the results in 
reducing the multi-collinearity of the population in figure 9.4 by gradually removing 
marker variables. The specific procedure for this method is as follows: identify two 
markers with the highest correlation among all variables, compare the number of 
substitution lines where they exist, delete the marker with the larger number of 
substitution lines, and repeat this process until the condition number is below a 
predetermined criterion, e.g., 1000. 

For the population given in figure 9.4, the number of QTLs is generally much 
lower than the number of chromosomal segments. Stepwise regression can be used to 
select the significant marker variables in the linear model of equation 9.11. In 
stepwise regression, the probabilities of variables entering and removing from the 
model, i.e., PIN and POUT, need to be specified. When the number of variables is 
small, PIN can be set to 0.05; when the number of variables is large, PIN needs to be 
reduced to avoid the overfitting problem (see §5.6 in chapter 5 for handling the 
overfitting issue in linear regression models). Coefficients of variables that do not 
enter the linear model are set to 0. In the significance test for the jth marker, 
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TAB. 9.1 — Multi-collinearity among markers in the CSSL population derived from japonica 
Asominori and indica IR24. 
Step Condition Two mostly correlated markers Correlation Marker 


number First Number of Second Number of coefficient deleted 


marker substitution marker substitution 


lines lines 

1 Infinite M14 3 M16 3 1 M16 
2 Infinite M26 2 M27 2 1 M27 
3 Infinite M66 2 M67 2 1 M67 
4 Infinite M75 3 M76 3 1 M76 
5 Infinite M60 4 M61 5 0.8872 M61 
6 Infinite M7 4 M8 3 0.8591 M7 

T Infinite M37 4 M38 3 0.8591 M37 
8 Infinite M74 4 M75 3 0.8591 M74 
9 Infinite M48 8 M49 6 0.8515 M48 
10 Infinite M12 5 M13 7 0.8312 M13 
11 Infinite M4 2 M5 3 0.8101 M5 

12 Infinite M28 2 M29 3 0.8101 M29 
13 Infinite M52 3 M53 2 0.8101 M52 
14 Infinite M66 2 M68 3 0.8101 M68 
15 Infinite M72 3 M73 2 0.8101 M72 
16 Infinite M73 2 M75 3 0.8101 M75 
17 Infinite M57 3 M58 5 0.7622 M58 
18 Infinite M64 3 M65 5 0.7622 M65 
19 Infinite M22 7 M23 4 0.7374 M22 
20 Infinite M23 4 M24 4 0.7339 M24 
21 6021 M31 5 M32 6 0.7062 M32 
22 1819 M55 6 M56 5 0.7062 M55 
23 1766 M19 1 M20 2 0.7016 M20 
24 1725 M33 4 M34 2 0.6960 M33 
25 1394 M2 6 M3 3 0.6901 M2 

26 1340 M35 3 M36 6 0.6901 M36 
27 1293 M14 3 M15 3 0.6508 M15 

758 


Note: 758 at the end of the table indicates the condition number between the remaining 
markers after removing the previous 27 markers. This condition number is below the empirical 
value of 1000, so it is assumed that there is no severe multi-collinearity between the remaining 
marker variables. 


phenotype values are firstly adjusted by the results of stepwise regression, 
1.€., equation 9.12, where br is the estimated value of parameter in equation 9.11. 
Purpose of the phenotypic adjustment is to exclude the effect of other markers, 
which is exactly the same as the background control in inclusive composite interval 
mapping (ICIM) introduced in previous chapters. 
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Ayi —0- 5 bizi, (9.12) 
kAj 


Assume that one QTL is located on the chromosomal segment represented by the 
jth marker. The two alleles in the background and donor parents are denoted by 
q and Q, respectively, and phenotypic distributions of two homozygous QTL 
genotypes qq and QQ are N(u,, o?) and N(uş, 07), respectively. Rearrange the order 
of substitution lines so that the marker types are identical to the background parent 
for i = 0, 1, 2,..., mi, and identical to the donor parent for i= m + 1, m + 2,..., 
n. Thus, Ay; follows distribution N(u,, o?) for i= 0, 1, 2,..., mi, and distribution 
N(ty, o?) for i= mi + 1, m + 2,..., n. 

The two hypotheses to test the existence of QTL are Ao: uw, € 4 and 
HA : ui A Uş. Under the null hypothesis Ho, all Ay; follow the same distribution 
N(uo, oğ), and the maximum likelihood estimates of distribution mean and distri- 
bution variance are, 


5 1 22 2 
Ho = 2 Ayi, öp = pa (Ayi — Ho) (9.13) 


Therefore, the maxima of log-likelihood function under Ab is, 


max In Lp = 7 In f(Ayss flo, 60) (9.14) 
i=0 
where f(Ayi;o,02) is the probability density function of normal distribution 
N(uş, og). The maxima of the log-likelihood function under HA is, 


ny 


maxin Ly =) “In f(Ayisi,6°)+ > mf(Ayi ho?) (9415) 


i=0 =m+1 


where 


1 mi 1 n 
q = A ij fly = A 12 
H mal .. Yi, H2 m 5 y 


i=m +1 


1 n n (9.16) 
gö 4-1 2 (Ay — in) + 27 (Asi = hə)” 
. i=0 =m +1 


The LRT statistic and LOD score can be calculated from equations 9.14 and 
9.15, and then used in the significance test. The additive effect of QTL can be 
estimated from relationships fj = m — a and jt) = m+ a, i.e., 


(Hy — in) (9.17) 


az 


NI = 


Assume that p and q are the frequencies of background and donor segments in 
the CSSL population, respectively, which can be calculated from the genotypic data. 
Genetic variance of the QTL on the segment is, 


Vo = plên)? + alfo)? — (pity + qi)” = 4pqa? (9.18) 
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Thus, phenotypic variance explained (PVE) by the QTL is, 


4 2 
PVEg = 21 


(9.19) 

P 
where Vpis the phenotypic variance of the mapping population. It is easy to see from 
equation 9.19 that PVE depends not only on the QTL effect but also on allelic 
frequencies. Therefore, it is possible to have a situation where a larger effect QTL has a 
lower PVE. Also, QTLs are not necessarily independent of each other, and the 
variances of individual QTLs are generally not additive. m QTL analysis of actual 
populations, the summation of variances or PVEs from different QTLs should be 
avoided. 


9.2.8 QTL Mapping for Grain Length in a CSSL 
Population in Rice 


Phenotypic data with two replicates in eight environments shows that the japonica 
variety Asominori has shorter grains and the indica variety IR24 has longer grains 
(table 9.2). The average grain length of the 65 substitution lines is close to 
Asominori, but there were substitution lines with grains shorter than Asominori. 
The longest grain length was lower or close to IR24. ANOVA showed that the 
variance among substitution lines reached a highly significant level in each envi- 
ronment, and the heritability of replicated mean was around 0.90. 

When the LOD score threshold was set to 2.5, and the condition number between 
marker variables was below 1000, a total of 18 segments affecting grain length were 
detected, distributed on eight chromosomes. In table 9.3, one QTL affecting grain 


TAB. 9.2 — Grain length (mm) in japonica Asominori, indica IR24, and the substitution line 
population in eight environments. 


Environment 
FE, E2 Es Ey E; Eg E7 Es 
Asominori 5.29 5.25 5.31 5.18 5.33 5.29 5.32 5.29 
IR24 5.91 5.91 5.98 5.85 5.85 5.95 5.98 5.85 
Minimum 4.88 4.88 4.94 4.83 4.91 4.97 4.93 4.94 
Maximum 5.83 5.81 5.83 5.82 5.80 5.87 5.83 5.84 
Average 5.27 5.28 5.36 5.20 5.29 5.32 5.36 5.28 
Standard 0.20 0.18 0.19 0.20 0.20 0.20 0.19 0.19 


deviation 

Genetic variance 0.0367 0.0282 0.0313 0.0311 0.0371 0.0362 0.0364 0.0348 
Error variance 0.0036 0.0102 0.0049 0.0137 0.0043 0.0029 0.0021 0.0031 
Heritability of 0.9537 0.8471 0.9269 0.8195 0.9456 0.9617 0.9716 0.9575 
replicated mean 

Note: The first two rows are the average of two replicated observations of the parents, and the 
latter rows are the descriptive statistics of phenotypic distribution, and genetic parameters 
estimated from the substitution line population. 


Chromosome 


Donor segment 
LOD score Fi 


Additive effect El, 


PVE (%) Et 


Es 


Note: LOD scores above the threshold of 2.5 are highlighted in bold, and the corresponding additive effects and phenotypic variances explained (PVE) are also highlighted. 


M3 
7.08 
8.00 
12.91 
7.66 
4.90 
6.58 
6.26 
2.95 
—0.16 
—0.13 
—0.17 
—0.11 
—0.15 
—0.15 
—0.15 
—0.09 
11.28 
7.34 
14.49 
4.64 
9.71 
9.69 
10.67 
3.40 


0.06 
2.16 
9.59 
7.60 
3.25 


. 9.3 — Chromosomal segments affecting grain length in one CSSL population in rice. 


2 

Miz 
0.04 
2.49 
3.74 
0.00 
0.06 
0.02 
0.02 
0.35 
—0.01 
—0.06 
—0.08 
0.00 
—0.02 
—0.01 
—0.01 
0.03 
0.05 
1.69 
3.00 
0.00 
0.10 
0.02 
0.03 
0.33 


Mis 
2.38 
0.11 
0.48 
5.97 
3.87 
2.63 
2.15 
0.30 
—0.09 
—0.01 
—0.03 
—0.09 
—0.13 
—0.09 
—0.08 
—0.03 
3.10 
0.07 
0.34 
3.28 
7.42 
3.28 
3.07 
0.29 


3 
Məs 
18.76 
21.17 
20.08 
29.37 
15.26 
14.41 
14.64 
19.41 
0.29 
0.24 
0.22 
0.29 
0.28 
0.22 
0.24 
0.27 
47.60 
33.19 
30.42 
43.96 
43.42 
28.91 
34.41 
42.61 


Mos 
3.08 
9.90 
6.90 
4.66 
0.08 
1.48 
0.02 
7.18 
0.17 
0.25 
0.19 
0.14 
—0.03 
0.11 
0.01 
0.25 
4.21 
9.78 
6.21 
2.52 
0.12 
1.82 
0.02 
9.69 


Məs 
3.30 
3.09 
0.53 
2.83 
2.03 
3.24 
0.74 
0.95 
—0.13 
—0.09 
—0.03 
—0.07 
—0.11 
—0.12 
—0.06 
—0.06 
4.62 
2.37 
0.38 
1.43 
3.64 
4.23 
1.04 
1.02 


4 

M30 
0.03 
0.01 
0.00 
4.28 
2.24 
0.95 
0.25 
0.00 
0.01 
0.00 
0.00 
—0.08 
—0.10 
—0.05 
—0.03 
0.00 
0.03 
0.01 
0.00 
2.27 
4.03 
1.14 
0.34 
0.00 


Maa 

7.26 

8.69 

12.80 
15.62 
9.75 

7.65 

10.20 
8.28 

—0.20 
—0.16 
—0.21 
—0.22 
—0.27 
—0.20 
—0.25 
—0.19 
11.79 
8.22 

14.48 
12.95 
20.33 
11.39 
18.74 
11.61 


6 

Ma2 
1.61 
0.00 
0.05 
2.90 
0.72 
0.02 
0.38 
0.02 
0.05 
0.00 
—0.01 
0.04 
—0.04 
—0.01 
—0.02 
—0.01 
2.13 
0.00 
0.03 
1.46 
1.00 
0.03 
0.44 
0.02 


Mas 
0.04 
6.67 
6.62 
0.22 
0.55 
0.27 
0.65 
2.71 
0.01 
—0.07 
—0.08 
0.01 
—0.03 
—0.02 
—0.03 
—0.06 
0.05 
5.34 
5.91 
0.10 
0.73 
0.31 
0.83 
3.06 


Mas 
0.33 
6.58 
7.67 
0.07 
0.04 
0.12 
0.06 
1.36 
0.05 
0.19 
0.20 
—0.01 
0.02 
0.03 
0.02 
0.10 
0.42 
5.67 
6.83 
0.03 
0.07 
0.14 
0.08 
1.47 


T 

Ms0 
0.05 
0.06 
4.20 
0.04 
0.00 
0.06 
0.00 
0.19 
—0.01 
0.01 
0.07 
—0.01 
0.00 
—0.01 
0.00 
—0.02 
0.05 
0.04 
3.42 
0.02 
0.00 
0.07 
0.00 
0.20 


Ms1 
0.18 
0.65 
0.32 
0.64 
0.69 
0.61 
2.95 
4.04 
0.02 
0.03 
0.02 
0.03 
0.05 
0.04 
0.10 
0.10 
0.22 
0.46 
0.22 
0.30 
1.17 
0.72 
4.42 
4.85 


11 
Mzz 
1.98 
2.08 
0.56 
3.48 
1.25 
2.37 
0.38 
2.21 
—0.10 
—0.07 
—0.04 
—0.08 
—0.09 
—0.10 
—0.04 
—0.09 
2.66 
1.54 
0.40 
1.81 
2.16 
3.00 
0.52 
2.49 


12 
Məs 
2.68 
0.09 
0.26 
4.46 
4.25 
1.79 
1.64 
0.09 
—0.07 
0.01 
—0.01 
—0.06 
—0.10 
“0.05 
—0.05 
0.01 
3.70 
0.05 
0.19 
2.35 
8.25 
2.10 
2.35 
0.09 


Mso 
0.20 
0.44 
0.62 
5.42 
6.71 
0.63 
0.98 
0.29 
0.04 
0.04 
0.05 
0.15 
0.31 
0.07 
0.09 
0.04 
0.25 
0.30 
0.43 
3.01 
14.22 
0.75 
1.36 
0.31 


Msz 
0.05 
0.00 
0.04 
11.59 
7.30 
1.15 
0.37 
0.34 
—0.01 
0.00 
—0.01 
—0.15 
—0.19 
—0.05 
—0.03 
—0.03 
0.07 
0.00 
0.03 
8.16 
15.71 
1.30 
0.51 
0.36 
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length was considered to be present on a segment when the LOD score in one envi- 
ronment exceeded the threshold of 2.5. Three QTLs located in segments Ms, Mə3, and 
M34 had LOD scores above 2.5 in the eight environments. The QTL located in seg- 
ment Mə, had the highest LOD score, and also explained the most phenotypic vari- 
ance in each environment. This QTL had LOD scores from 14.41 (environment Eg) to 
29.37 (environment E,) and explained 28.91% (environment Eg) to 47.60% (envi- 
ronment E,) of the phenotypic variance. The allele from IR24 had an increased effect 
on grain length in all environments. This QTL was subsequently fine-mapped and 
cloned (Wan et al., 2006, 2008; also see §9.4 for more information about this QTL). 
The QTL located in segment M3 had LOD scores from 2.95 (environment Ex) to 
12.91 (environment Ez), and PVE values from 3.40% (environment Eg) to 14.49% 
(environment Ba). The QTL located in segment M34 had LOD scores from 7.26 
(environment E,) to 15.62 (environment E4), and PVE values from 8.22% (envi- 
ronment Eş) to 20.33% (environment Eş). For both segments, alleles from IR24 had 
increased effects on grain length in all environments. The QTL on the other seg- 
ments only had LOD scores above the threshold of 2.5 in some environments. QTLs 
on other segments also had relatively smaller effects on grain length, and explained 
the lower proportion of phenotypic variance, compared with QTLs located in seg- 
ments Mə, Məş, and M34. Interestingly, however, despite the small genetic effects, 
some of the small-effect QTLs also had stable effects across environments. For 
example, two QTLs located on segments Mj) and Mjg reduced grain length in all 
environments, and two QTLs on segments Ms: and Map increased grain length in all 
environments (table 9.3). Major-effect QTLs are often much easier to select and 
exploit in breeding, and long-term selection may have fixed most major-effect genes. 
Further improvement in genetic gain may depend on those genes with smaller but 
stable effects. From this perspective, the detection of stably expressed small-effect 
QTLs and their genetic studies are also of great value to breeding applications. 


9.3 QTL Mapping in Genetic Populations of Multiple 
Parents Crossed with One Common Parent 


QTL mapping based on bi-parental populations has become a routine approach in 
genetic studies of complex traits in plants and animals. However, in most bi-parental 
populations, recombination has not had enough time to shuffle the genome into 
small fragments, and QTLs are generally located in large chromosomal regions. In 
addition, if two parents do not have polymorphism at one genetic locus, it is 
impossible to detect the gene present at this locus in this population (Li et al., 2010). 
In the past decade, there has been a great interest in using multiple parents to make 
crosses and develop populations in genetic studies, such as the four-parental and 
eight-parental populations introduced in chapter 8. With more genetic variation in 
multi-parental populations, we are able to investigate the genetic basis of plant 
traits more completely (Wang et al., 2011; Li et al., 2011). To study the genetic 
diversity of flowering time in maize, the Maize Diversity Organization 
(http://www.panzea.org) selected 25 maize inbred lines with a wide range of sources 
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and crossed with the common inbred parent B73 to produce a total of 25 bi-parental 
pure-line populations, each consisting of about 200 recombinant inbred lines (RILs) 
and suitable for linkage analysis. Together, they formed a population of approxi- 
mately 5000 in size for association mapping. "This design is called nested association 
mapping (NAM), and the population generated by this design is called the NAM 
population (Guo et al., 2010; Buckler et al., 2009; McMullen et al., 2009). This 
section introduces the application of ICIM to the NAM populations, which was 
called joint inclusive composite interval mapping (JICIM), but the number of 
populations is not limited to 25 (Li et al., 2011). 


9.3.1 Generalized Linear Regression and Model Selection 


Suppose that there are F (F > 1) bi-parental RIL populations sharing one common 
parent. Population size of each family is ny (f= 1, 2,..., F), and the total size is 
N= 2-/-L....F ny. Similar to ICIM, JICIM also has two steps (Li et al., 2011). m the 
first step, the effect of each family and the effects of marker variables in each family 
are estimated. The total number of parents is equal to F + 1, so each marker has 
F + 1 levels (i.e., one common parent and the other F founder parents). The gen- 
eralized linear model (GLM) for phenotypic observations is, 


Y= bo + au + XB+ e (9.20) 


where Y is the vector of phenotypic values; bọ is the intercept of the linear model; 
u? = (un, tW,..., up) is the vector representing the population effect between each 
parent and the common parent, and æ is the N X F design matrix relating each 
us (f= 1, 2, ..., F) to Y; B is the İ(F + 1) x m] X 1 vector including the effects of 
m markers each with (F + 1) levels; and e is the vector of residual effects. To avoid 
the overfitting problem, parameters in the linear model of equation 9.20 are 
estimated using stepwise regression. If the variables do not enter the model, the 
coefficients of the corresponding marker variables are set to zero. 


9.3.2 Parameter Estimation and Hypothesis Testing 
in JICIM 


Based on the first step of model selection and coefficient estimation in equation 9.20 
by stepwise regression, the second step of JICIM is to perform genome-wide scan- 
ning. Assuming that the current marker interval is (k, k + 1), the phenotype of 
individual i (i= 1, 2,..., ny) in family f (f= 1, 2,..., F) is adjusted in order to 
exclude the effects of QTLs outside the current scanning interval, i.e., 


Ayy = Uy — Cif Uf — 5 b Tip (9.21) 
JARR+1 


If one QTL (alleles are denoted as Q; and Qo) is present at the current testing 
position in the fth family, QTL genotype Q;Q;follows normal distribution N (up, o$), 
QoQo follows normal distribution N(uọ, o$), and the adjusted phenotype 
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Ayy (i= 1, 2,...,n,) follows a mixture distribution of two normal distributions. 
Proportions of the two component distributions depend on recombination fre- 
quencies between the QTL and its two flanking markers (table 9.4). The presence of 
QTL at the current scanning position can be tested by the following hypotheses. 


Ho : Hi Hə mü Hr Ho (9 22) 
HA : at least one of 4, Hə, . . , upis not equal to uo l 


The log-likelihood function under the alternative hypothesis H4 is, 


nf 


ln LA = 3 DDD, (Agir: Hy, or) + (1 — pp) Ö(Agir: Uo, o?) (9.23) 


=1 j=l ieS; 


where S) represents the jth marker type group (j = 1, 2, 3, 4; table 9.4); py represents 
the proportion of QTL genotype Q;Q; in the jth marker group in the fth family; uy 
and [lg are distribution means of QTL genotypes Q/;Q;and Qo Qo, respectively; and 
®(e; u, 6?) represents the probability density of normal distribution N(, 0°). 

The EM algorithm is used to estimate the F + 1 distribution means and dis- 
tribution variances in equation 9.23, and the maximum likelihood estimates of 
means and variances are represented by jlo, fly, flg,..., fp and 67, öz, ..., öz, 
respectively. Then the additive effect of QTL in each family (ür) is given by, 


a hyn, 4 
çim” (itp — İn) (9.24) 


Under the null hypothesis Ho, all Agir (i= 1,..., n) (f 1... BN = Ej n) 
follow the same normal distribution N (Ho, 67). Maximum likelihood estimates of 
distribution mean and distribution variance are, 


nf nf 


İQ = . 2. Agir, ör = 3 (Agir — jig)” (9.25) 


The log-likelihood function under null hypothesis Ho is given in equation 9.26, 
from which the maximum likelihood estimates of distribution parameters together 
with the maxima of likelihood function can be acquired. 


F y 


In Lo = 5 5 In O(Ayi; Ho, a3) (9.26) 


fel i-1 


The LRT statistic and LOD score are calculated from the maximum 
log-likelihoods under the two hypotheses, i.e., equations 9.23 and 9.26, and then 
used to test the presence of QTL at the current scanning position. 

For the F families generated by the NAM design, there are a total of F + 1 
homozygous genotypes at each QTL. Phenotypic mean of the common parental 
genotype is fig. The phenotypic means of the other F genotypes can be expressed as 
Ho +2a1,..., Mo +2ar respectively, using the additive effects estimated by 
equation 9.24. Frequencies of various genotypes in the entire NAM population 
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TAB. 9.4 — Distribution of two QTL genotypes in the jth family under four groups of marker 
type in the current mapping interval (k, k + 1). 


Group of Sample Frequency of Frequency Distribution of Ay; 
marker type size marker type of QTL 
genotype 
QIQ, o% 
1 i 
1 nn 2017 maksi) pfl 1-pa paNlup o?) + (1 — pr)N Co, o?) 
1 y 
2 np o TRRri pe 1-ppe pp N (my, 07) + (1 — pr)N(uo, o?) 
1 ; 
3 ng 2 That ps 1-ps pps N (My, 07) + (1 — vps) N (to, o?) 
1 P 
2 na 2017 ms) pa 1-pa paNlup 0) + (1 pa)Ni, 47) 
Note: Q, is the allele in the fth parent, and Qo is the allele in the common parent. 
pp = (1 = rq) = raj+1)/(1 — müza), pr = (1— Tja) Taj+1/Tij+1, pr = 1 — pp, 


Pra = 1 — pp, where fjg, Tgj+1 and rjj+1 represent the recombination frequencies between 
marker j and QTL, between QTL and marker j+ 1, and between markers j and j+ 1, 
respectively. N (uy; 07) and N (Ho, o?) represent the phenotypic distribution of two QTL 
genotypes Q,Q? and qoqo in the fth family, respectively. Assume that pure lines in each family 
are doubled haploids (DHs). For RILs, r in the third column should be replaced with the 
accumulated recombination frequency during the repeated selfing. 


Lim me 
VNO 2N? 
NAM population is, 


are equal to respectively. Therefore, the genetic variance of one QTL in the 


F F 2 
= 5 Tir o y nf 
The proportion of phenotypic variance explained by each QTL is given by, 


F F m. 
= 2279 KG = (227-ı xar) 


PVE 
Vp 


(9.28) 


vrhere Vp — 4 4 Vp, is the total phenotypic variance, and Vp, is the phenotypic 
variance in the fth family. 


9.3.8 QTL Mapping for Flowering Time in an Arabidopsis 
NAM Population 


Four inbred-line parents in Arabidopsis, i.e., Landsberg erecta (Ler, N20), Kashmir 
(Kas-2, N1264), Kondara (Kond, CS6175), and Antwerp (An-1, N944), were used 
to generate three RIL families according to the following crossing scheme: (An-1, 
N944) x (Ler, N20), (Ler, N20) x (Kas-2, N1264) and (Ler, N20) x (Kond, 
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CS6175). Ler was used as the male parent in the first family but as the female 
parent in the other two families. Self-pollinated seeds from the F, hybrid were 
planted into Fə generations, with 120, 164, and 120 individual plants from the three 
crosses, respectively, and then advanced to the Fy generation by single seed descent 
to form the three RIL families with different sizes. The three RIL families are 
abbreviated as Ant-1, Kas-2, and Kond, respectively. Sixty-four, 77, and 75 markers 
were screened in the three RIL families to construct the respective linkage maps 
with genome lengths of 371 cM, 441 cM, and 351 cM, respectively. There were 39 
polymorphic markers that were common in the three families, allowing the inte- 
gration of the three individual maps into one consensus map (Brachi et al., 2010; 
Ehrenreich et al., 2009; El-Lithy et al., 2006). 

The LOD score threshold was determined to be 3.30 by permutation tests. 
A total of nine QTLs affecting flowering time were detected (table 9.5). There were 
two QTLs each on chromosomes 1 and 4, three QTLs on chromosome 5, and one 
each on chromosomes 2 and 3. Of the nine QTLs, one with the largest LOD score 
was located on chromosome 4, near marker FRI. qA1-1 was closer to the left marker 
of CIW1; qA1-2, qA4-2, and qA5-2 were all closer to the right markers of the 
neighboring intervals, i.e., markers SNP110, SNP199, and ngal39, respectively. 


TAB. 9.5 — QTL mapping results for flowering time in one NAM population consisting of 
three RIL families in Arabidopsis. 


QTL Chr Position Left Right LOD Additive effect (day) 

name (cM) marker marker score in. Kas2 Kond 

qAl-1 1 66.00 CIW1 F6D8.94 3.75 —0.55 1.19 —2.64 
(65.20) (69.40) 

qA1-2 1 107.00 SNP157 SNP110 4.69 —0.12 1.32 1.00 
(104.40) (107.80) 

qA2 2 0.00 msat2.5 msat2.5 3.05 0.01 0.38 2.87 
(0.00) (0.00) 

qA3 3 0.00 SNP105 ngal72 13.39 2.32 1.05 2.74 
(0.00) (2.90) 

qA4-1 4 3.00 msat4.41 FRI 43.23 0.82 —5.60 -—11.24 
(0.00) (3.60) 

qA4-2 4 49.00 SNP295 SNP199 8.51 0.06 3.90 2.02 
(46.60) (49.90 

qA5-1 5 19.00 SNP136 SNP358 13.67 1.50 2.78 3.48 
(17.30) (20.60 

qA5-2 5 33.00 SNP236 ngal39 5.87 0.05 —0.99 —4.18 
(32.10) (33.50 

qA5-3 5 93.00 SNP101 SNP304 8.00 —1.12 2.76 —1.01 
(92.50) (93.80 


Note: The number in parentheses after the marker name is the position (cM) of the marker on 
the integrated genetic linkage map: in the estimates of additive effects, bold indicates that the 
QTL also reached the significance level from QTL mapping in individual RIL families. 
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9.4 Mendelization of Quantitative Trait Genes 


The multi-factorial or polygenic hypothesis is the theoretical basis of classical 
quantitative genetics. In the case where the hypothesis holds, phenotypic values of 
quantitative traits approximately follow normal distributions, which are difficult to 
be grouped, needless to say, the observation of common Mendelian segregation ratios. 
To detect more genes affecting the traits of interest in genetic studies, two parents 
with as much variation as possible are often selected for making the crosses when 
developing the genetic populations. In such a population, variance in the subpopu- 
lation consisting of the same genotype on one targeted QTL includes, in addition to 
random error, the genetic variance from the segregation of many other QTLs. As a 
result, none of the individual QTLs explains a high proportion of the phenotypic 
variance, and therefore different QTL genotypes are difficult to see from the pheno- 
typic distribution. 

One QTL is assumed to have an additive effect a = 2 and a dominant effect 
d= —2, i.e., the allele of higher trait value is recessive relative to the allele of lower 
trait value. If there are many other QTLs in the F» population in which the QTL is 
present, the background genetic variance generated by these QTLs is denoted by 
Vp, and error variance is denoted by V,. The proportion of phenotypic variance 
explained (PVE) by this QTL under the condition of no segregation distortion in the 
Fə population is given by, 


152 1:72 
şa + qd 


ła? + 4d? + Vet V: 


PVEọ = (9.29) 


From the above equation, it is easy to see that if there are many other QTLs 
segregating in the Fə population, these QTLs will produce a large background 
variance, which will reduce the PVE by the targeted QTL. Also, if there is a large 
random error in phenotypic evaluation, this will also reduce PVE. For example, 
when Vg + V: = 10, PVEg = 0.23, the phenotypic distribution is close to a normal 
distribution, and the distribution corresponding to the three QTL genotypes cannot 
be separated from the phenotypic values alone (figure 9.5A). 

Additive and dominant effects of one QTL reflect the phenotypic difference 
arising from different genotypes at this locus, which is generally difficult to change. 
However, it is possible to construct some secondary genetic populations such that 
only one particular QTL segregates, thereby greatly reducing the background 
genetic variance Vp. In addition, the uniformity of environmental conditions in 
which the genetic materials are phenotyped can be improved through suitable field 
experimental designs, thereby reducing the variance of random error in phenotyp- 
ing. As can be seen from equation 9.29, the PVE of QTL will be increased as the 
variance component from the background QTLs decreases. When Vg + Va = 4, 
PVEg = 0.43, and phenotypic values show one skewed distribution (figure 9.5B); 
when Vg + V, = 2, PVEg = 0.60, and phenotypic values show one bimodal dis- 
tribution (figure 9.5C); and when Vg + V; = 0.5, PVEg = 0.86, and phenotypic 
values show two distinct distributions (figure 9.5D). The additive and dominant 
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Fic. 9.5 — Phenotypic distribution of different genotypes at one QTL segregating in the Fə 
population under four values of variance from the background QTLs and random errors. 
Notes: (A) sum of background genetic variance and error variance is 10; (B) sum of back- 
ground genetic variance and error variance is 4; (C) sum of background genetic variance and 
error variance is 2; (D) sum of background genetic variance and error variance is 0.5. 


effects of QTL do not change during this procedure, and the increase in PVE comes 
from the decreased background genetic variance and random error variance. 

For example, in an Fə population, if the phenotypic distribution in figure 9.5D is 
observed, the traditional Mendelian method can be used to divide the individuals 
into two phenotypic groups, i.e., low and high values, and perform the 3:1 segre- 
gation ratio test on observed sample sizes of the two phenotypic groups. If the 3:1 
ratio can be fitted for a quantitative trait, the observed phenotypic difference can be 
explained by one dominant gene, similar to those traits investigated in Mendel’s 
hybridization experiment. Therefore, the gene affecting the quantitative trait is 
called to have been Mendelized. The procedure of Mendelization, fine mapping, and 
cloning of quantitative trait genes is further illustrated below using one gene on rice 
grain width as an example. 


9.4.1 Preliminary Mapping of One QTL on Grain Width 
of Rice in One RIL Population 


The mapping population consisted of a set of recombinant inbred lines (RILs). Two 
parents in making the cross were japonica variety Asominori and indica variety 
IR24, and 71 F; RILs were obtained from the repeated selfing starting from Fj, 
which were used in the genotypic and phenotypic evaluation. In four environmental 
conditions (indicated by E;—Eg), Asominori had a grain width about 2.70 mm, 
which was wider; [R24 had a grain width around 2.10 mm, which was narrower. The 
phenotypic frequency distribution in figure 9.6 shows that the RIL population has 
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Fic. 9.6 — Phenotypic distribution of grain width in four environments in a population of 71 
recombinant inbred lines from the cross between Asominori and IR24. 


grain width from 2.00 to 3.00 mm under the four environmental conditions, which is 
continuous without any obvious trend of grouping, but the transgressive segregation 
can be observed in both directions (Wan et al., 2004, 2005, 2006, 2008). 
Seventy-one RILs were genotyped using 250 molecular markers which are 
polymorphisms among the two parents, and linkage map of the 12 chromosomes was 
constructed from the genotypic data, with a total map length of 1203 cM, and an 
average distance between neighboring markers of 5.05 cM. QTL mapping for grain 
width in each environment was performed using the ICIM method, and the prob- 
ability of marker variables entering the linear model in stepwise regression was set at 
0.001. LOD score profiles in the four environments are shown in figure 9.7. It can be 
seen that there is a significant peak at similar positions on chromosome 5, i.e., at 
22 cM, in the four environments, and the genetic effect has the same direction across 
environments, naming this QTL as q@W-5. The LOD score profile also has obvious 
peaks on some other chromosomes, such as chromosomes 3, 8, 9, and 10 (figure 9.7). 
However, these peaks are only present in one or a few environments. If the locations 
of these peaks are also considered as QTLs, they are less environmentally stable. 


9.4.2 Validation of the Grain Width QTL by Chromosomal 
Segment Substitution Lines 


qGW-5, previously identified in one RIL population across a number of environ- 
ments, was validated by the population of substitution lines as shown in figure 9.4. 
Under the four environmental conditions, i.e., E;-Eg, background parent Asominori 
had a grain width around 2.75 mm, i.e., a wider grain shape. The grain width of 
donor parent IR24 was around 2.45 mm, i.e., a narrower grain shape. The pheno- 
typic frequency distribution in figure 9.8 shows that the CSSL population had grain 
width from 2.40 to 3.00 mm under the four environmental conditions, which is 
continuous without obvious multimodality. However, since the genetic contribution 
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Fic. 9.7 — LOD score profiles from one-dimensional scanning of QTLs for grain width in a 
population of 71 recombinant inbred lines generated from the cross between Asominori and 
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Fic. 9.8 — Frequency distribution of grain width in four environments for 65 CSSLs of IR24 
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with Asominori as the background parent. 


to each CSSL from the background parent was much higher than that from the 
donor parent, the grain width of the CSSL population was mostly located around 


the background parent Asominori. 
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QTL mapping was performed on grain width in each environment using the 
method introduced in §9.2. Ignoring the multi-collinearity between markers, the 
probability of marker variables entering the model in stepwise regression was set at 
0.01. The LOD score of each segment in the four environments is shown in figure 9.9. 
Clearly, the LOD score exceeds the threshold value of 2.5 in all environments on the 
chromosomal segment represented by the 35th marker (marker name C263) where 
qGW-5 was located, and the genetic effects in the four environments were estimated 
in the same direction. 
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Fic. 9.9 — LOD scores from QTL mapping on grain width in four environments in 65 
chromosomal segment substitution lines of IR24 with Asominori as the background parent. 


Genotypes of two CSSLs and their average grain width in the four environments 
are shown in figure 9.10. CSSL28 has the same genotype as the donor parent [R24 at 
markers C263, R3166, R569, and R2289, and the other chromosomal segments are 
from the background parent Asominori. The grain width of CSSL28 was signifi- 
cantly lower than that of Asominori, and this difference was caused by the difference 
of chromosomal segment C263-R2289. In other words, segment C263-R2289 of [R24 
contains the gene which decreases the grain width, in contrast with the alternative 
gene in the segment of Asominori. The genotype of CSSL29 on markers C263, 
R3166, and R569 was the same as that of the donor parent IR24, and the other 
chromosomal segments were from the background parent Asominori. The grain 
width of CSSL29 was significantly lower than that of Asominori, and this difference 
was caused by the difference in chromosomal segment C263-R569. That is to say, the 
C263-R569 chromosomal segment from IR24 has the gene that decreases the grain 
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Average grain width (mm) Chromosomes 1~4 Genotype on chromosome 5 Chromosomes 6~12 
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qGW-5 was mapped at 22 cM in RIL population 


Fic. 9.10 — Phenotypes and genotypes of background parent Asominori, donor parent IR24, 
and two chromosomal segment substitution lines (i.e., CSSL28 and CSSL29) with narrower 
grains. Notes: Chromosomal segments of parent Asominori are indicated by blank boxes and 
those of parent IR24 are indicated by shaded boxes; the number below the marker name on 
chromosome 5 is the position (cM) on the linkage map constructed from the RIL population; 
marker R2289 has no data in the RIL population, so its position on the linkage map is not 
given. 


width, either. CSSL28 and CSSL29 had nearly identical grain width, indicating that 
the gene decreasing grain width may not be located on the chromosomal segment 
defined by marker R2289. In the RIL population, q@W-5 was mapped at 22 cM on 
chromosome 5, between markers R3166 and R569. Up to now, the result obtained 
from the RIL population has been further validated by the CSSL population. 


9.4.8 Mendelization of a Stable QTL on Grain Width 


CSSL28 differs from the background parent Asominori only on the chromosomal 
segment from marker C263 to marker R2289. If there is only one QTL present on the 
segment for grain width, there would be still one QTL segregating in the Fə pop- 
ulation derived from the cross between CSSL28 and the background parent, and 
thus a population segregating only at one quantitative trait locus is developed. To 
obtain shorter donor segments containing qGW-5, the F, hybrid between CSSL28 
and Asominori was backcrossed to Asominori for four succeeding generations. 
During the backcrossing procedure, only individuals with narrow grains were 
selected and backcrossed with Asominori to ensure that the donor segment carrying 
qGW-5 was always maintained after backcrossing. Finally, the narrow-grain indi- 
viduals were selected for selfing in the BC,F, population to produce a large sec- 
ondary F» population. A total of 2171, 1248, 2465, and 897 secondary F» individuals 
were planted under other four different environmental conditions (denoted by 
E9-E;2) for phenotyping. Grain width in the four environments had similar phe- 
notypic distributions, and the frequency distribution of the 2465 individuals under 
E,, is shown in figure 9.11. 

In the secondary F» population shown in figure 9.11, the grain width of the 2465 
plants can be clearly classified into two groups: narrow and wide. Narrow-grain 
individuals were closer to CSSL28, wide-grain individuals were closer to Asominori, 
and the F; hybrid was closer to CSSL28. Thus, the wide-grain phenotype is reces- 
sive, relative to the narrow-grain phenotype; or equivalently, the narrow-grain 
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Fic. 9.11 — Distribution of grain width in one secondary F population derived from the 
narrow-grain chromosomal segment substitution line, i.e., CSSL28, and the background 
parent. 


phenotype is dominant to the wide-grain phenotype. The frequency of the 
wide-grain individuals was low, and the genotypic screening showed that none of 
these individuals carried the donor chromosomal segment. The wide-grain individ- 
uals in the secondary Fə population had identical genotypes to Asominori. 
The frequency of the narrow-grain individuals was high, and the genotypic screening 
indicated that these individuals carried the donor chromosomal segment from IR24. 
The two phenotypes, i.e., narrow-grain and wide-grain, differed significantly from 
the 3:1 segregation ratio. Further analysis indicated there is one partial sterility gene 
located near qGW-5, which was possibly the S31(t) gene that was previously 
mapped at 20.9-24.7 cM on chromosome 5 (Zhao et al., 2009). The significant 
deviation of the two phenotypes from the expected 3:1 segregation ratio is in fact 
caused by the tight linkage between qGW-5 and the locus of S31(t). 

For the Fə population in figure 9.11, grain width was converted to a qualitative 
trait controlled by one single gene; the wide-grain allele comes from Asominori, and 
the narrow-grain allele comes from IR24; the wide-grain allele is recessive in the F; 
hybrid. This gene was later named by gu-5, and the wide-grain and narrow-grain 
alleles were denoted by gw-5 and Gu-5, respectively. The homozygous genotype 
gu-5gu-5 has wide grains, and both the heterozygous genotype Gu-5gu+5 and the 
homozygous genotype Gu-5Gw-5 have narrow grains. 


9.4.4 Fine Mapping and Functional Analysis of the Gene 
at a Stable Grain Width QTL 


On the chromosomal segment from marker Y1060L to marker R569 with a length of 
10.6 cM where qGW-5 is located (figure 9.12A), additional markers that differ 
between Asominori and TR24 were developed (figure 9.12B). These markers were 
used to genotype the BC,F2 population derived from CSSL28 and the background 
parent, and then to estimate the recombination frequencies between these markers 
and gu-5 (figure 9.12C). The results showed that three simple sequence repeat 
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(SSR) markers (i.e., RM3328, RM3322, and RM5874) and one expressed sequence 
tag (EST) marker (i.e., C53703) were present on the left of gur5, and one SSR 
marker (i.e., RM5994) was present on the right of gu-5 (figure 9.12C). 

Ten polymorphic SSR markers were designed using DNA sequences of the six 
BAC/PAC overlapping clusters in the sequence of rice variety Nipponbare 
(figure 9.12B, C), and then used to genotype the BCAF? population. Genotypic data 
showed that among the 805 homozygous genotypes of gw-5, 23 and 6 recombinants 
occurred for RMw530 and RMw513, respectively, and thus gu-5 was mapped to a 
chromosomal segment with a genetic distance of 1.8 cM. From the sequence data of 
Nipponbare, it is known that the physical locations of RMw530 and RMw513 are at 
5,309,078 bp and 5,358,806 bp on chromosome 5, respectively, and thus gu-5 was 
fine-mapped to a DNA sequence segment with a length of 49.7 kb (figure 9.12C). 

Weng et al. (2008) further mapped the GW-5/qGW-5 locus within an over- 
lapped interval and found that one 1212-bp deletion present in wide-grain varieties 
was associated with grain width. It was also verified that the deletion was selected at 
high intensity to increase the grain yield during the artificial domestication and 
breeding improvement in rice. Thereafter, Liu et al. (2017) found that a gene 
encoding calmodulin, located upstream of the deletion region, significantly affects 


o o ` 
3 ggs B88 8 BY BRR ATSA BRE 
H No cn 424240 z z — Z x — H Nm Z AANA a 
S Um x C tas x o x OM UMD K Us Uv 


A Chr.5 
CM 34 50 93,1616 106 , 102 58 102 48 31 75 301532 100 0739 


Y H 
B OSJNBa0029B02 P0473H02 
çə P0681D04 Bə 


B1140B01 


AS 


RMvS54 a” 
RMv/522 
RMw528~., ee 
RM5994 


“ . 
+ a bə) oo o n FAAM 
ss $ & q g G 6 ia 
4M on Z 2 6 Z Z BEES 
sö = = =n 3 2 552 > 
S x x x Oe uş & 
C Chr.5 
cM 323 161 3.17 005 043 143 0.370.06 0.43 0.43 0.310.31 4.79 
Number of 167 115 89 38 30 23 6 7 14 21 26 31 108/805 
aoe Kee ee 5829 | 641 | 49.7 İzal 61] 1904 İg2.4 1222.3 | 
Sa a 3 o ıı ın o > 


3352.9kb (3509948-6862857bp) 


Fic. 9.12 — Fine mapping of the narrow-grain gene gw-5 in rice. Notes: (A) Position of 
qGW-5 on chromosome 5 of rice detected in the RIL and CSSL populations; (B) Six 
BAC/PAC overlapping clusters on the chromosome interval with a length of 10.6 cM where 
qGVV-5 is located; (C) Fine mapping of the narrow-grain gene gu-5 on a chromosomal 
segment with a length about 49.7 kb. 


QTL Mapping in Other Genetic Populations 427 


the grain width in rice and is mainly expressed in the glumes during seed develop- 
ment. This gene is the candidate gene for GW-5/qGW-5, and the 1212-bp deletion in 
wide-grain varieties regulates grain size by regulating the expression of GW-5. These 
studies provide deep insights into the function and biochemical pathway of the 
major-effect gene in grain width which is a typical quantitative trait and also 
provides one classical example of systematic genetic studies and gene dissection for 
quantitative traits controlled by multiple genes. 

In the example mentioned above, it took more than ten years from the prelim- 
inary QTL mapping to the complete understanding of how the QTL/gene con- 
tributes to the final phenotype on grain width and grain weight in rice. The time 
spent would be much longer, should the population development for preliminary 
QTL mapping be counted. Even so, we anticipate that, in the near future, more and 
more QTLs or genes of quantitative traits will be Mendelized, fine-mapped, iso- 
lated, cloned, and functionally analyzed. Such information not only strengthens our 
understanding on the inheritance of quantitative traits but also helps to utilize new 
biotechnological approaches, such as genome-editing and molecular design, in the 
more-targeted improvement of quantitative traits in breeding. 


9.5 Association Mapping in Natural Populations 


9.5.1 Linkage Disequilibrium is the Prerequisite of Gene 
Mapping 


For any genetic population, two (or more) loci are called to be in equilibrium if the joint 
genotypic frequencies are equal to the product of genotypic frequencies at each locus; 
otherwise, the two (or more) loci are called to be in disequilibrium. In probability 
theory, two events A and B are defined to be mutually independent if the probability 
for them to happen simultaneously is equal to the product of two individual proba- 
bilities of events A and B. If genotypes at locus A and locus B are considered to be two 
probability events, and the joint genotype at both loci is considered to be the simul- 
taneous occurrence of two events, equilibrium between two loci has the same meaning 
as the independence of two probability events. Therefore, when two loci are in equi- 
librium, the two loci are sometimes said to be independent of each other. The inde- 
pendence test based on the contingency table can be applied to test the significance of 
disequilibrium between two or even more genetic loci (see exercises 9.7 and 9.8). 

In the case of two loci, for example, i.e., locus A (two alleles are denoted by A and 
a) and locus B (two alleles are denoted by B and b), the fact that they are in 
equilibrium means that the frequency of joint genotype AABB is equal to the pro- 
duct of the frequency of genotype AA and the frequency of genotype BB, frequency 
of AA Bb is equal to the product of the frequency of AA and the frequency of Bb, and 
so on (table 9.6). The joint genotypic frequencies as given in table 9.6 are also called 
the frequencies at equilibrium. When in equilibrium, the three genotypes at locus B 
have the same conditional frequencies under different genotypes at locus A 
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(table 9.6); in the meantime, the three genotypes at locus A have the same fre- 
quencies under different genotypes at locus B. In a population, if there is some 
difference between the joint genotypic frequencies at two loci and the frequencies at 
equilibrium as given in table 9.6, some degree of disequilibrium occurs between the 
two loci. In a population in disequilibrium, the frequencies of joint genotypes cannot 
be derived from the genotypic frequencies at individual loci; the three genotypes at 
locus B will have different conditional frequencies under different genotypes at locus 
A; the three genotypes at locus A will have different conditional frequencies under 
different genotypes at locus B, either. 


TAB. 9.6 — Joint genotypic frequencies at locus A and locus B, and the conditional genotypic 
frequencies at locus B, when loci A and B are in equilibrium. 


Genotype and Genotype and frequency at locus B Conditional genotypic 
frequency frequency at locus B 

at locus A BB, pep Bb, pm Bb, pn BB Bb bb 
AA, pas DAAPBB PAAPBb DAAD) PBB Po Pob 
Aa, Pha PAaPBB PAaPBb PAaPobb PBB PBb Poo 
aa, Paa PaaP BB PaaP Bb PaaPbb PBB PBb Poo 


In table 9.6, assume locus A is a molecular marker, and locus B affects one 
phenotypic trait. When the population is grouped by marker types, the subpopu- 
lations corresponding to the three marker types AA, Aa, and aa will have identical 
phenotypic means, and therefore no association can be observed between the marker 
and the trait, or between the marker and the gene affecting the trait. If disequilib- 
rium occurs between loci A and B, and the frequencies of joint genotypes are not 
equal to the frequencies at equilibrium as given in table 9.6, the three genotypes at 
locus B would have different conditional frequencies under different genotypes at 
locus A. The subpopulations composed of different marker types would have dif- 
ferent phenotypic means, and there would be some association between the marker 
and the trait or between the marker and the genes affecting the trait. Thus, when 
using molecular markers to locate genes on phenotypic traits, disequilibrium 
between the marker locus and the gene locus is the prerequisite. Only under the 
presence of disequilibrium, are we able to observe the association between the 
marker and the phenotypic trait that is controlled by genes; to detect and manip- 
ulate the genes on a phenotypic trait by their closely linked molecular markers. 

For the populations under controlled pollinations, as described in previous 
chapters, when recombination frequency between two loci is lower than 0.5, there 
has to be disequilibrium whether ever the populations are bi-parental or 
multi-parental. On the other side, disequilibrium between two loci reflects the 
genetic linkage relationship between the two loci. Disequilibrium caused by the 
genetic linkage between two loci is also called linkage disequilibrium (LD). As will be 
seen below, random mating can reduce LD between two linked loci; admixture of 
populations with different structures can cause disequilibrium between genetic loci, 
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as can many other factors such as selection and inter-genic interactions. Genetic 
linkage and disequilibrium in natural populations from uncontrolled crosses do not 
necessarily have the mutually dependent relationship between genetic linkage and 
disequilibrium. To determine whether the disequilibrium observed in a natural 
population is caused by genetic linkage or other factors, it is often necessary to know 
the origin or derivation process of the population, and the kinship and pedigree 
relationships among individuals in the population. 


9.5.2 Linkage Disequilibrium in Random Mating 
Populations 


Allelic frequencies and genotypic frequencies at one locus in a large random mating 
and non-selected population follow the Hardy-Weinberg equilibrium law. In popu- 
lations at the Hardy-Weinberg equilibrium, genotypic frequencies at one locus can 
be derived from allelic frequencies at the locus (Wang, 2017; Falconer and Mackay, 
1996). If frequencies of two alleles A and aat locus A are p and q, frequencies of three 
genotypes AA, Aa, and aa are equal to p°, 2pq, and q”, respectively. Replacing the 
genotypic frequencies in table 9.6 with the expected frequencies at Hardy-Weinberg 
equilibrium at locus A and locus B will have the results in table 9.7. 

While investigating the disequilibrium between two loci from the joint genotypic 
frequencies, a larger number of joint genotypes have to be considered, which is not 
convenient sometimes, especially when multiple alleles at each locus are taken into 
consideration. For two or more loci in random mating populations, frequencies of the 
diploid genotypes can also be obtained, provided the frequencies of haploid gamete 
types are known. In fact, random mating between individuals is equivalent to 
the random combination between the female gametes and male gametes. Therefore, 
the nine genotypic frequencies at equilibrium as given in table 9.7 can be equiva- 
lently treated as the random combination of four gamete types AB, Ab, aB, and ab 
at frequencies pAPp, PAP, Papp, and Papp, respectively (the readers are invited to 
confirm this for themselves), and the four frequencies are called the gamete-type 
frequencies at equilibrium. Clearly, the presence of disequilibrium between the four 
gamete types is exactly equivalent to that between the nine genotypes. Therefore, in 
random mating populations, it is sometimes possible to focus only on the disequi- 
librium between gametes, for ease of analysis. 


TAB. 9.7 — Joint genotypic frequencies at two equilibrium loci A and B, and the conditional 
frequencies at locus B in random mating populations. 


Genotype and Genotype and frequency Conditional genotypic 
frequency at locus B frequency at locus B 

at locus A BB, p% Bb, 2ppps bb, p? BB Bb bb 
AA, pa PiP 2p% PBPo PAP) De 2PBPb pı 
Aa, 2p4Da 2DADaDp AD ADaPBDb 2pA pap pa 2pBpı pi 


aa, p? pal 2p3papı PDs Dy 2pBpr pl 
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Assume that the actual frequencies of gametes AB, Ab, aB, and ab (or the 
theoretical frequencies obtained from the population development, see §3.4 for 
examples) in one random mating population are u, s, t, and v, respectively 
(table 9.8). Of course, the four frequencies have to satisfy the restriction that the 
sum is equal to one. Clearly, the frequency of gene A at locus A is p4 = u+ s, and 
the frequency of gene B at locus B is pp = u + t. Calculating the difference between 
the frequency of gamete AB (paz = u), and the equilibrium frequency papp yields 
equation 9.30, which measures the degree of disequilibrium for gamete type AB. 
Similarly, equation 9.31 can be obtained for degrees of disequilibrium for the other 
three gamete types. Degrees of disequilibrium as defined in equations 9.30 and 9.31 
is calculated from the gamete-type frequencies and are sometimes referred to as the 
gametic disequilibrium (Falconer and Mackay, 1996). On the other hand, if the 
degree of gametic disequilibrium D between loci A and B is known, the four 
gamete-type frequencies can be expressed in terms of the frequencies at equilibrium 
and the degree of disequilibrium, which are listed in the last row in table 9.8. 


Dap = pap — pape = u — (u+ s)(u+ t) = u(l — u — s — t) — st=uv— st (9.30) 


Day = —(uv — st), Dag = — (uv — st), Da = w — st (9.31) 


TAB. 9.8 — Gamete-type frequencies and the measurement of disequilibrium in random 
mating populations. 


Gamete type AB Ab aB ab 
Actual or observed frequency PAB = U PAb = s Pag et Pab = V 
Frequency at equilibrium PAPB PAPb PaPB PaPb 


Frequency at disequilibrium (same 


as the actual or observed frequency) 7?A?P8 tD pap-D papp=D papot D 


The allelic linkage relationship occurred in the four gametes in table 9.8 can be 
classified into two phases, i.e., coupling and repulsion, according to how the alleles 
are combined. If AB and ab are referred to as the linkage in coupling, Ab and aB are 
referred to as the linkage in repulsion. In one bi-parental population, if the geno- 
types of two parents are AABB and aabb, AB and ab are referred to as the parental 
gamete types, and Ab and aB are referred to as the recombinant gamete types. 
When two coupling gametes are combined, the genotype thus formed is AB/ab, and 
the frequency is equal to uv. When two repulsive gametes combined, the genotype 
thus formed is Ab/aB, and the frequency is equal to st. When the degree of dise- 
quilibrium given in equations 9.30 and 9.31 is equal to 0, the two linkage phases 
AB/ab and Ab/aB in the double heterozygous genotype AaBb would have the same 
frequency in the population, both equal to the product of the frequencies of the four 
alleles A, a, B, and b. 

From equations 9.30 and 9.31, it can be seen that degrees of disequilibrium can 
be either positive or negative for the four gametes, but have the same absolute value. 
This common absolute value is referred to as the degree of linkage disequilibrium 
between loci A and B, denoted by D, or equation 9.32. 
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D = |wv — st) (9.32) 


When wv-— st > Ü, since the frequencies of gametes Ab and aB (last row in 
table 9.8) cannot be negative, D cannot exceed the maxima as given in 
equation 9.33. When uv — st<0, since the frequencies of gametes AB and ab (last 
row in table 9.8) cannot be negative, D cannot exceed the maxima as given in 
equation 9.34. To facilitate the comparison of disequilibrium in different popula- 
tions, the ratio of D to the maximum degree of disequilibrium is sometimes used to 
measure the degree of disequilibrium, i.e., equation 9.35, which is called the relative 
degree of disequilibrium, taking values from 0 to 1. 


Dmax = min{papo, PapB) (9.33) 
Dmax = Min{ paps, papi} (9.34) 
D 
D — R 
Dan (9.35) 


For the four gametes in table 9.8, if X is used to denote the indicator variable at 
locus A, and Yis used to denote the indicator variable at locus B, the disequilibrium 
between the two loci then actually reflects the correlation between indicator vari- 
ables X and Y (table 9.9). 


TAB. 9.9 — Indicator variables for the four gamete types at two loci A and B, each with two 
alleles. 


Gamete type Frequency Locus A (X) Locus B (Y) 
AB u 1 1 

Ab s 1 = 

aB t = ll 1 

ab ül -1 =l 


Based on probability theory, expectation and variance of variable X can be 
calculated as follows. 


E(X) =uxl+sx14+tx (-1)+v~x (-1) = (u+ s) — (t+ v) = p4 — Pa 
V(X) = ux P+sx 17 -£ x (1) +v x (1)” — F(X) 
= (utst+t+v)— E”(X) — 1— E”(X) 
= (PA + Pa)” — (PA — Pa)” = 4PAPa 
Similarly, expectation and variance of variable Y can be obtained, i.e., 


E(Y) = (ui) — (s+ v) = pg — po, V(Y) = 4papı 
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Covariance between variables X and Y is calculated as follows. 
Cov(X, Y) =ux1lxil+sx1~x (-1)+¢tx (-1) x 14+. ux (-1) x (-1) 
— E(X)E(Y) 
— u—s—t-Hu-— (u-v) +(s-— t) 
= (u— u?) — (s — s”) — (t- P) + (v-t) +2(w — si) 


Therefore, covariance between variables X and Y can be further written as, 
Cov(X, Y) = 4(wv — st) = 4D 


Correlation coefficient between variables X and Y is obtained as, 


Cov(X, Y) 4D D 
T= = = 
V V(X) V(Y) VApapa x 4pBpp VPAPaPBpp 


Therefore, whether the degree of disequilibrium D is equal to zero is equivalent 
to whether there is a correlation between variables X and Y. Correlation can be 
either positive or negative. To ensure a positive degree of disequilibrium, equa- 
tion 9.36 gives the other indicator, taking the value from 0 to 1, which is more 
frequently used to measure the degree of disequilibrium between pairs of loci in 
actual populations (Devlin and Risch, 1995). As stated earlier, the x° statistic in 
the independence test using the contingency table can be applied to test the 
equilibrium between two loci. The readers are invited to demonstrate for them- 
selves that the r” given in equation 9.36 when multiplied by the total sample size is 
actually equal to the y” statistic in the independence test using the contingency 
table (see exercise 9.9). 

D2 
2 


r= 
PAPaPBPb 


(9.36) 


9.5.8 Factors Influencing Linkage Disequilibrium 


As a matter of fact, all factors that can cause the change in population structure at 
one single locus also affect the degree of disequilibrium between two or more loci. 
Some of the factors are mating system, mutation, migration, selection, and random 
drift, which affect the allelic frequencies and genotypic frequencies at individual loci, 
and the equilibrium relationship between different loci. In genetic studies, especially 
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vrhen molecular markers are used to detect the genes controlling the phenotypic 
traits in interest, it is expected that the disequilibrium between loci is caused by 
genetic linkage, rather than the other factors. Only then can the mapping results be 
used for marker-assisted selection in breeding or for further studies of gene fine 
mapping and cloning. The following is a brief description of how the degree of 
disequilibrium is affected by random mating, admixture of populations with differ- 
ent structures, and selection. 

The four gametes as shown in table 9.8 combine randomly to produce the diploid 
individuals and the frequencies of various genotypes of diploids are given in the 
second column in table 9.10. The last four columns in table 9.10 give the frequencies 
of four gametes produced by each diploid genotype, where r is the recombination 
frequency between locus A and locus B. The dot-product of the column in which 
each gamete is located and the genotypic frequencies in column 2 (two corre- 
sponding elements are multiplied and then summed up for the ten diploid geno- 
types) gives the frequency of the corresponding gamete in the progeny population, 
which are listed in the second-last row in table 9.10. 


TAB. 9.10 — Calculation of the linkage disequilibrium in progeny population after one 
generation of random mating. 


Genotype Frequency Gametes produced 

AB Ab aB ab 
AABB u? 1 0 0 0 
AABb 2us 0.5 0.5 0 0 
AAbb s? 0 1 0 0 
AaBB 2ut 0.5 0 0.5 0 

1 1 1 1 
AB/ab 2uv 54 ə” a” z307” 
Ab/aB 2st . 0-9) a-r) ir 
Aabb 2su ô 0.5 0 0.5 
aaBB P 0 0 1 0 
aaBb 2tu 0 0 0.5 0.5 
aabb v 0 0 0 1 
Total 1 u—(uv—st)r s (uu — sir t+(wv-— st)r vu — (us — st)r 


Degree of disequilibrium 2o(1 — r) D(1 — r) Do(l — r) Do(l — r) 


Calculate the degree of linkage disequilibrium for gamete type AB in the progeny 
population as does in equation 9.37. 


D, = lu — (wv — st)r] — papg = u — pappt (uv — st)r 


9.37 
= Do — Dor = Do(1 — r) ( ) 


Let D, denote the degree of linkage disequilibrium in the +generation of random 
mating, which can be obtained from equation 9.37. 
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D; = Do(1 — r)” (9.38) 


As can be seen from equation 9.38, random mating reduces the degree of linkage 
disequilibrium, though the one-meiosis recombination frequency keeps fixed from 
generation to generation. The consequence would be that after many generations of 
random mating, the disequilibrium between linked loci may not be observed in the 
progeny populations with limited sizes, leaving the genetically linked loci to be at 
equilibrium in the population. Recombination frequency can only affect the rate at 
which the disequilibrium is reduced during random mating. For example, with 
r = 0.5, after each generation of random mating, the disequilibrium will be reduced 
by half, and after a few generations of random mating, the degree of disequilibrium 
D will converge to 0, and the disequilibrium between the two loci will no longer exist. 
With r < 0.5, after each generation of random mating, the degree of disequilibrium is 
reduced by a rate of (1 — r). Even for a small recombination frequency, i.e., close 
linkage, after many generations of random mating, the degree of disequilibrium D will 
also converge to 0, and the disequilibrium between linked genes ceases to exist. 

Admixture of two populations at equilibrium but with different genetic struc- 
tures may produce the disequilibrium between two loci in the mixture population; 
admixture of two populations at disequilibrium may produce a mixture population 
at equilibrium. For the two equilibrium populations shown in table 9.11, the 1:1 
mixture population has a degree of disequilibrium D = 0.1225. However, this dise- 
quilibrium is not necessarily related to the genetic linkage between locus A and locus 
B. Regardless of the linkage relationship between locus A and locus B, and 
regardless of the value of recombination frequency in the case of linkage, the dise- 
quilibrium of the 1:1 mixture population is always equal to 0.1225. Even if there is 
no genetic linkage between locus A and locus B, disequilibrium in the mixture 
population still exists after a few generations of random mating, since D is only 
reduced by half after each generation of random mating. For the two disequilibrium 
populations given in table 9.11, the degree of disequilibrium D = 0 in the 1:1 mix- 
ture population. Therefore, the disequilibrium that occurred in populations I and 11 
can not be seen in the mixture population. Therefore, in the mixture population, 
disequilibrium and genetic linkage are not always mutually indicative. Only when 
the other factors are excluded, one can say the observed disequilibrium between two 
loci indicates the two loci are genetically linked. 

Selection is one most important and effective factors in changing the population 
structure. In addition to changing the frequencies of genes and genotypes at an 
individual locus, the selection also affects the degree of disequilibrium between loci 


TAB. 9.11 — Effect of population admixture on the degree of disequilibrium. 


Population Mixture of two equilibrium Mixture of two disequilibrium 
populations populations 
AB Ab aB ab D AB Ab aB a D 
Population 1 0.01 0.09 0.09 0.81 0 04 02 0.1 03 0.1 
Population II 0.64 0.16 0.16 0.04 0 0.2 04 03 01 -0.1 


1:1 mixture 0.325 0.125 0.125 0.425 0.1225 0.3 0.3 02 02 0 
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simultaneously. In an equilibrium population, selection can produce disequilibrium 
between loci. In a disequilibrium population, selection may also result in an 
equilibrium population of progenies. For the equilibrium population shown in 
table 9.12, assuming a fitness value of 1 for both gametes AB and ab, and a fitness 
value of 0.1 for both Ab and aB, the degree of disequilibrium after selection is 
D = 0.131. For the disequilibrium population shown in table 9.12, the degree of 
disequilibrium is D = 0.15 before selection. Assuming a fitness value of 0.25 for both 
gametes AB and ab, and a fitness value of 1 for both gametes Ab and aB, the degree 
of disequilibrium of the gametes after selection is D = 0. Actually, the change in 
disequilibrium caused by selection is ascribed to the changes in allelic frequencies 
caused by selection. 


TAB. 9.12 — Effect of selection on the degree of disequilibrium. 


Selection in on equilibrium Selection in one disequilibrium 

population population 

AB Ab aB ab D AB Ab aB ab D 
Before selection 0.4 0.1 0.4 0.1 0 0.4 (0.1 0.1 0.4 0.15 


Gametic fitness 1 0.1 0.1 1 0.25 1 1 0.25 
After selection 0.727 0.018 0.073 0.182 0.131 0.25 0.25 0.25 0.25 0 


9.5.4. Comparison of Linkage and Association Approaches 
in Gene Mapping 


In most cases, molecular markers are pieces of DNA sequence without particular 
functions on phenotypic traits. Genetic mapping is implemented by detecting the 
linkage relationship between molecular markers and the QTLs/genes affecting one 
specific trait. In bi-parental and multi-parental populations described in 
chapters 4-8, the linkage between markers and QTLs has to result in disequilibrium 
in those populations. That is to say, the QTL genotypes would have different fre- 
quencies in the sub-populations classified by marker types. Different phenotypic 
means that occurred in different QTL genotypes will inevitably lead to the difference 
between the means of sub-populations defined by marker types. Therefore, the 
significant difference in phenotypic means from different marker types can be used to 
test whether there is a significant linkage relationship between the marker and the 
gene controlling the trait. This is actually the principle of single marker analysis 
introduced in §4.1, chapter 4. The mapping populations used in linkage analysis are 
progenies derived from pure-line parents showing a significant phenotypic difference. 
In these populations, the linkage between loci results in a higher proportion of 
parental types in the population than recombinant types; linkage disequilibrium can 
be easily observed in the population wherever the linkage is present. In addition, 
mapping populations from the controlled crosses have well-defined allelic 
and genotypic frequencies, and the issue on population structure is absent. 
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The disequilibrium observed in these populations represents the genetic linkage 
relationship between different loci, and disequilibrium and genetic linkage are 
mutually dependent. In human and animal genetic studies, linkage analysis is nor- 
mally based on nuclear families. In a population consisting of nuclear families, 
individuals are related to each other closely, and also have approximately the same 
degree of relationship or kinship. There is generally no population structure either, 
and the disequilibrium between loci and genetic linkage are mutually dependent. In 
these populations, the linkage distance between any two genetic loci can be esti- 
mated from the degree of disequilibrium, based on which the genetic linkage maps 
are built. 

However, the QTL linkage mapping approach also has its own limitations. One is 
that the number of QTLs that can be detected is rather limited. For example, when 
the mapping population is developed from two pure-line parents, as described in 
chapters 4-6, the linkage analysis involves only two alleles at each locus. If one QTL 
carries the same allele in both parents, it cannot be detected in the progeny popu- 
lation from the cross between these two parents. For the detected QTLs, there is no 
way to know whether multiple alleles are present at each QTL. This limitation is, of 
course, overcome to some extent by multi-parental populations described in 
chapters 7 and 8. Secondly, the resolution of QTL is low. When developing the 
mapping populations, the number of recombination events between loci is limited 
due to the limited number of crossing generations and the rapid fixation by repeated 
selfing. The precision of linkage analysis is generally ranged between 10-20 cM. 
Increasing the marker density helps to improve the QTL mapping precision, but the 
degree of improvement is limited by the size of the mapping population. In mapping 
populations with sizes below 200, after a certain level of marker density, such as one 
marker every 5-10 cM, adding more markers will not significantly improve the 
detection power and precision (Li et al., 2010; §10.3, chapter 10). Thirdly, QTL 
mapping results from specific populations and environments cannot be extended to 
populations from other crosses and environments. QTL detected in one genetic 
background or environment may be not detected in another background or envi- 
ronment due to the epistatic interactions between genes, as well as the interactions 
between genes and environments. The same gene may also have different effects 
under different genetic backgrounds and environmental conditions. Therefore, QTL 
detected in a specific environment in one specific population is subject to validation 
by using other populations and environments. 

The underlying principle for the association mapping approach to locate QTLs is 
not fundamentally different from the linkage mapping approach. Both approaches 
utilize the association between a marker and phenotypic trait to detect the linkage 
relationship between the marker and genes/QTLs. However, the genetic populations 
used in association mapping are generally naturally pollinated, with more complex 
origins or relationships. In these populations, the long-term random mating previ- 
ously endured can obscure the linkage relationship between genetic loci. Even if 
there is a genetic linkage between two loci, disequilibrium is not always observed in 
the population. In addition, disequilibrium between two loci can be caused not only 
by genetic linkage but also by other factors such as the admixture of populations 
with different structures, and selection. In association mapping populations, 
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individuals always have unequal kinships, that is to say, some are more closely 
related, and some are less closely related, indicating the issue of population struc- 
ture. Disequilibrium observed in association mapping populations does not neces- 
sarily represent genetic linkage; there is no causal relationship between 
disequilibrium and genetic linkage. 

The phenomenon as shown in table 9.11 that the degree of disequilibrium in the 
mixture population is inconsistent with the two component populations also falls 
into the Simpson’s paradox, a phenomenon proposed by the British statistician 
E. H. Simpson in 1951. Simpson’s paradox refers to the phenomenon that two sets of 
data when analyzed separately and jointly yield contradictory conclusions. 
Assuming that two alleles A and a affect flowering time in one plant species, 
genotypes of two populations consisting of pure-line varieties were investigated, as 
well as the flowering time grown in two environments. Population I was planted in a 
short-flowering environment, and the sample sizes of genotypes AA and aa were 10 
and 90, respectively, corresponding to the mean flowering time at 50 and 60 days, 
respectively. Obviously, the flowering time of genotype AA was earlier than that of 
genotype aa. Population II was planted in a long-flowering environment, and the 
sample sizes of genotypes AA and aa were 40 and 60, respectively, corresponding to 
the mean flowering time of 80 and 85 days, respectively. Obviously, the flowering 
time of genotype AA was also earlier than that of genotype aa. If the two popula- 
tions are mixed, the mean flowering time of genotype AA is obtained as 


10x50 +40x80 — 74 days, and the mean flowering time of genotype aa is obtained as 
90x60 +60x85 = 70 days. In the mixture population, genotype AA appeared to flower 


later than genotype aa, which is seemly contradictory to the findings in the two 
populations. 


TAB. 9.13 — Effect of population admixture on genetic analysis. 


Population Sample size Mean value of trait (d) Deviation from mean (d) 
AA aa AA aa Mean AA Aa 
Population I 10 90 50 60 59 -9 1 
Population II 40 60 80 85 83 -3 2 
Mixture 50 150 74 70 71 — 4.2 1.4 


Simpson’s paradox as shown in table 9.13 arises when the two genotypic fre- 
quencies differ in the two populations, indicating that it is not appropriate to carry 
out the simplified joint analysis. If the joint analysis must be conducted, the trait 
values have to be adjusted first by population means. The data in table 9.13 shows 
that the mean values of flowering time in the two populations are 59 and 83 days, 
respectively. In population I, the deviation from the mean value of 59 days is —9 
days and 1 day for the two genotypes, respectively. In population IT, the deviations 
from the mean value of 83 days for the two genotypes were —3 and 2 days, respec- 
tively. These deviations are referred to as genotypic effects in the following text. 
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Using the effect of genotype AA in both populations gives an effect of —4.2 days in 
the mixture population; using the effect of genotype aa in both populations gives an 
effect of 1.4 days in the mixture population. The joint analysis using the effects 
adjusted by population means gave the same conclusions as the two populations, 
namely that genotype AA flowered 10 days earlier than aa in population I, genotype 
AA flowered 5 days earlier in population II, and genotype AA flowered 5.6 days 
earlier when the two populations were mixed. In table 9.13, Simpson’s paradox 
would not arise if genotype AA had the same frequency, and genotype aa had the 
same frequency in both populations. Strictly speaking, populations with structural 
differences cannot be mixed for genetic studies; if the joint analysis must be per- 
formed, the original data has to be properly adjusted so as to avoid Simpson’s 
paradox. 

For the reasons mentioned above, when using association mapping in gene 
detection, on the one hand, a high density of markers has to be used in genotyping, 
from which the markers that are tightly linked to the genes are sought, with the 
expectation that the disequilibrium between these markers and genes has not been 
completely broken by random mating. On the other hand, structure analysis has to 
be performed on the mapping population to avoid the disequilibrium caused by 
population structure (Thomas, 2010; Yu et al., 2006; Hirschhorn and Daly, 2005). 
Currently, whole genome sequencing has been completed for a number of species. 
Based on the sequencing data, hundreds of thousands or even millions of single 
nucleotide polymorphisms (SNP) markers have been developed. Large-scale, and 
high-throughput screening of SNP markers in association mapping populations is 
becoming increasingly realistic. However, efficient analysis of the structure hidden in 
natural populations, and how to effectively avoid the pseudo-genetic linkage due to 
population structure, remains to be a challenge. Structure in an association mapping 
population is often unknown, and how to accurately assess the population structure 
and classify the individuals into sub-populations with the help of statistical methods 
to eliminate the influence of structure on genetic analysis remains a matter of great 
concern in association mapping. In addition, disequilibrium is also affected by other 
factors such as selection and random drift, and how to avoid the disequilibrium 
caused by these factors is also important. Since multiple factors may occur in natural 
populations to make impacts on population structure and disequilibrium simulta- 
neously, how to sample the natural populations to create the most suitable popula- 
tions for genetic studies is particularly important in association mapping. 

Currently, genetic mapping based on natural populations is only referred to as 
association. Whether the association represents genetic linkage is subject to further 
validation. Most problems that occurred with the linkage mapping approach are 
actually also applicable to association mapping. A large amount of missing heri- 
tability observed in human association mapping studies may also reflect the limi- 
tations of association mapping studies from one other aspect (Eichler et al., 2010; 
Manolio et al., 2009; Maher, 2008). Therefore, one should not exaggerate the role of 
association mapping in genetic studies, especially in genetic studies of plant species, 
where the controlled crosses can be more easily made, and the complicated issue of 
population structure can be properly avoided. 
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Exercises 


9.1 Assume that a disease-resistant gene (two alleles at the locus are denoted by 
Rand S, respectively) is linked to a marker locus (two alleles are denoted by A and 
B, respectively), and the recombination frequency between the gene and marker is 
0.02. One cross is made between two parents RRAA and SSBB to produce an Fə 
population. Assume that genotypes of the Fə individuals are identifiable at the 
disease-resistant locus. 


(1) Calculate the frequencies of three marker types AA, AB, and BB, and the 
frequencies of alleles A and B, respectively, in the subpopulation consisting of 
the disease-resistant homozygous genotype RR, and subpopulation consisting of 
the disease-susceptible homozygous genotype SS. 

(2) Randomly select 5 individuals from the subpopulation of genotype RR, 
extract DNA and mix their DNA samples into the resistant pool. What is 
the probability of the presence of allele B in the resistant pool? What is the 
probability that allele B is present if 3 individuals are randomly selected 
to generate the resistant pool? What is the probability that allele B is 
present if only one individual is randomly selected to generate the resistant 
pool? 

(3) How will the above results change if the recombination frequency between the 
disease-resistant gene and the marker is only 0.01? Based on the results, what 
should be taken care of when using the bulked segregant analysis to detect the 
disease-resistant gene? 


9.2 Assuming that SSL1-SSLS are eight single-segment lines of a donor chromo- 
some, the following table gives two coding methods on segments 5; and Sg, one 
method coding the two parental segments as 0 and 2, and the other one coding the 
two parental segments as —1 and 1. 


Segment Background SSLI SSL2 SSL3 SSL4 SSL5 SSL6 SSL7 SSLS Donor 


parent parent 
Sı 0 2 0 0 0 0 0 0 0 2 
Sə 0 0 2 0 0 0 0 0 0 2 
Sı Sal 1 1 1 1 J 1 1 1 1 
S2 zi məzi 1 1 1 1 1 1 1 1 


(1) Calculate the correlation coefficient between Sı and S using the 8 
single-segment lines in the table. 

(2) Calculate the correlation coefficient between Sı and S using the 8 
single-segment lines and the background parent in the table. 

(3) Calculate the correlation coefficient between Sı and S using the 8 
single-segment lines and the donor parent in the table. 
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9.3 The data from the survey of the MN blood type in two populations are shown in 
the table below 


Blood type MM MN NN Total 


Population I 475 89 5 569 
Population 11 233 385 129 747 


(1) Calculate the allelic and genotypic frequencies in both 
populations. 

(2) Test whether the two populations are at Hardy-Weinberg equilibrium. 

(3) Ifthe two populations are combined to form a mixture population, calculate the 
allelic and genotypic frequencies in the mixture population and test for 
Hardy-Weinberg equilibrium. 


9.4 Suppose that there is no linkage between two loci, and two alleles at the two loci 
are denoted by A-a and B-b. An equal number of individuals having genotypes 
AABB and aabb are mixed, and the mixture population is called generation 0. 
Calculate the degree of disequilibrium between the two loci, and the theoretical 
frequency of the four gametes in the first and second generations of random mating. 
This exercise shows that for two genetic loci that are not linked if disequilibrium 
occurs in the initial population, the disequilibrium still exists after a few generations 
of random mating. 


9.5 Suppose the recombination frequency between two loci r — 0.1, and two alleles at 
the two loci are denoted by A-a and B-b. An equal number of individuals having 
genotypes AABB and aabb are mixed, and the mixture population is called gener- 
ation 0. Calculate the degree of disequilibrium between the two loci, and the theo- 
retical frequency of the four gametes in the first and 100th generations of random 
mating. This exercise shows that for two genetic loci that are linked, even if dise- 
quilibrium occurs in the initial population, the disequilibrium may disappear after 
many generations of random mating. 


9.6 Assume that two alleles at one locus are A and a, and the additive and dominant 
effects are equal to 3 and 2, respectively. Two alleles at the other independent 
genetic locus are B and b, and the additive and dominant effects are equal to 2 and 1, 
respectively. Random error variance on the phenotypic trait is equal to 4, and no 
other genetic factors are considered. 


(1) In the Fs population generated by crossing AA BB and aabb as the two parents, 
calculate the proportion of phenotypic variance explained by locus A and locus 
B, respectively. 

(2) Calculate the proportion of phenotypic variance explained by locus A in the F, 
population generated by crossing AAbb and aabb as the two parents. 


9.7 The observed sample sizes of nine genotypes at two loci in a population are 
shown in the table below, where the row frequencies are the three genotypic fre- 
quencies at locus A, and the column frequencies are the three genotypic frequencies 
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at locus B. The row and column frequencies are multiplied together to have the joint 
genotypic frequencies at equilibrium (or when loci A and B are independent or not 
linked). The joint genotypic frequencies at equilibrium when multiplied by the total 
sample size will give the expected sample sizes when the two loci are assumed at 
equilibrium, and thus one x” statistic can be calculated. The degree of freedom of the 
x” statistic is equal to (number of rows minus one) times (number of columns minus 
one), which is equal to 4 in the 3 X 3 contingency table. This is the independence 
test using the contingency table. Try to use this test to determine whether there is a 
significant disequilibrium between the two loci. 


Genotype BB Bb bb Row sum Row frequency 
AA 72 120 1 193 0.2503 

Aa 124 260 7 391 0.5071 

aa 2 3 182 187 0.2425 
Column sum 198 383 190 771 


Column frequency 0.2568 0.4968 0.2464 


9.8 The observed sample sizes of nine genotypes at two loci in a population are 
shown in the table below. Try to determine whether there is a significant disequi- 
librium between the two loci by independence test using the contingency table. 
Under the condition of equilibrium between the two loci, calculate the frequencies of 
the two linkage phases AB/ab and Ab/aB for the double heterozygous genotype 
AaBb in the population; calculate the frequencies of the four haplotypes AB, Ab, aB, 
and ab in the population. 


Genotype BB Bb bb Row sum Row frequency 
AA 18 39 15 72 0.144 

Aa 80 153 72 305 0.610 

aa 33 64 26 123 0.246 

Column sum 131 256 113 500 


Column frequency 0.262 0.512 0.226 


9.9 The observed sample sizes of four haplotypes at two loci in a population are shown 
in the table below, where the row frequencies are the two allele frequencies at locus A, 
and the column frequencies are the two allele frequencies at locus B. The row and 
column frequencies are multiplied together to have the gamete-type frequencies at 
equilibrium. The frequencies at equilibrium when multiplied by the total sample size 
will give the expected sample sizes at equilibrium, and thus one y” statistic can be 
calculated with one degree of freedom in the 2 X 2 contingency table. 


(1) Try to determine whether there is a significant disequilibrium between the two 
loci by the independence test in the contingency table. 
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(2) Calculate the gamete-phase linkage disequilibrium D. Validate that r” as given 
by equation 9.36 when multiplied by the total sample size of 100 is equal to the 
x” statistic for the independence test in the contingency table. 


Allele B b Row sum Row frequency 
A 23 44 67 0.67 

a 22 11 33 0.33 

Column sum 45 55 100 

Column frequeney 0.45 0.55 


9.10 Conduct QTL mapping using the selective genotyping analysis in a bi-parental 
population included in the QTL IciMapping software, and compare with the results 
from single marker analysis. 


9.11 Conduct QTL mapping in a population of chromosomal segment substitution 
lines included in the QTL IciMapping software, using the regression-based likelihood 
ratio approach. 


9.12 Conduct QTL mapping using JICIM in a NAM population included in the QTL 
IciMapping software. 


Chapter 10 


More on the Frequently Asked 
Questions in QTL Mapping 


QTL mapping has become one conventional approach in genetic studies on quan- 
titative traits, providing fundamental information on gene fine-mapping and 
map-based cloning. The mapping results also provide breeders with important 
genetic information on breeding targeted traits and the opportunity to trace and 
select the desirable genes by their closely linked molecular markers, and therefore 
improve the accuracy in the selection and the predictability in breeding. However, it 
is not trivial to conduct a meaningful and valuable QTL mapping study, as has been 
observed in previous chapters. Questions are frequently asked in QTL mapping, 
which can be classified into three categories: questions on statistical methods, 
estimation of genetic parameters, and mapping populations (Li et al., 2010). Some 
questions may have been mentioned and explained in detail, such as the choice of 
suitable LOD threshold, and comparison of mapping methods through simulation 
and power analysis. In the last chapter of this book, explanation and discussion are 
given on other questions which have not been fully or less addressed in previous 
chapters. 


10.1 Genetic Variance and Contribution to Phenotypic 
Variation of the Detected QTL 


10.1.1 Genetic Variance and Phenotypic Contribution 
from One QTL 


In one bi-parental population, frequencies of the three QTL genotypes QQ, Qq, and 
qq are represented by fgg, fag, and fig, respectively. In QTL mapping, the three 
phenotypic means (or genotypic values in equivalence) are estimated first, denoted 
by aq, Haq and Hgg, based on which additive and dominant effects of the QTL are 
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calculated. The additive and dominant model establishes the linear relationship 
between phenotypic means and genetic effects, as given in equation 10.1. Genetic 
variance caused by the QTL (i.e., Vg) depends on the three phenotypic means and 
frequencies of the three genotypes in the population, i.e., equation 10.2. 


Hog = M+ 4, Ugg = M+ d, Ugg = M-a (10.1) 


Vo = foqlioo “qalin, + faatea — (footog + foatoq + faoliy) (10.2) 


From equations 10.1 and 10.2, the relationship between genetic variance and 
genetic effects can be acquired, as given in equation 10.3. If only two homozygous 
genotypes, i.e., QQ and qq, are present in the population with frequencies fgg 
and fy respectively, genetic variance caused by the QTL is reduced to 
equation 10.4. 


Va = [foo + fa — (faa — fu) 1a” — 2fealfoa — faa)ad+ (for — fa)” (10.3) 


Vo = footy (10.4) 


Equations 10.2—10.4 give the genetic variance at one QTL. In an F; population 
without segregation distortion, we have fgg = 0.25, fo, = 0.5, fgg = 0.25, and 
Vo = $a? + te. In a DH or RIL population without segregation distortion, we 
have fog = 0.5, faq = 0, f = 0.5 and Vg = a°. When distortion occurs in pop- 
ulations with three genotypes at one locus, the multiplicative term of additive and 
dominant effects is included in genetic variance, as can be seen from equation 10.3. 
This term can be either positive or negative, making it difficult to judge the 
magnitude of genetic variance from genetic effects. The effects of segregation 
distortion on genetic variance and the QTL detection power will be further 
discussed in §10.5. 

The contribution of one QTL is defined by the proportion of genetic variance 
caused by the QTL in the phenotypic variance of the mapping population, which is 
also called the phenotypic contribution or phenotypic variance explained (PVE) by 
the QTL. The contribution or PVE of one QTL is normally represented by 
percentages, as defined in equation 10.5. 


PVE = “2 x 100% (10.5) 

Vp 
where Vg is the genetic variance of the QTL as defined in equations 10.2-10.4, and 
Vp is the phenotypic variance on the trait in interest in the mapping population. In 
populations without any segregation distortion, the genetic variance of one QTL 
depends only on its genetic effects. QTL with larger effects also causes larger 
variation and therefore has higher PVE. When distortion is present, genetic variance 
depends on the genotypic frequencies as well. QTL with larger effects does not 
always have a higher PVE. 
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10.1.2 Genetic Variance and Phenotypic Contribution 
of Linked QTLs 


When two or more QTLs are linked, or in disequilibrium, the joint genotypic 
frequencies cannot be simply derived from the frequencies at each locus. In this 
situation, the sum of PVEs from multiple QTLs or even the PVE of one QTL 
may exceed 100%. Two linked QTLs will be used here as an example to show 
this phenomenon, and no other genetic factors are considered. Assume that the 
two parental genotypes are Qı Qı QəQə and qıqıqəqə, ai and aş are additive effects 
of the two QTLs, and r is the recombination frequency. In the bi-parental DH 
population, theoretical frequencies and trait values of the four homozygous 
genotypes are given in table 10.1. When considered independently, genetic vari- 
ances of the two QTLs are Vi = a, and V2 = ağ, respectively. When the two 
QTLs are considered jointly, the total genetic variance in the population is given 
in equation 10.6, which is obviously not equal to the summation of variances of 
the two QTLs. 


Ve=5(1 r)(ay + ay)” +5rla — ay)” + 2.r(ai — a)” 4 at r)(a + a)” 


= ai +a, +2(1- 2r) ay aş 


It can be seen from equation 10.6 that Vg = V; + V2 only when r = 0.5, i.e., the 
two QTLs are not linked. For example, when a, = 1.0, a = 1.0, and V; = 0.4, 
genetic variances of the two QTLs are Vi = Və = 1. When r = 0.5, the total genetic 
variance Vg = 2, phenotypic variance is Vp = 2.4, and heritability of the trait is 
equal to 0.833. PVE of each QTL is equal to 41.7%, and the sum of the two PVEs 
is equal to heritability, i.e., the proportion of the total genetic variance in pheno- 
typic variance. When QTLs are not linked, directions of QTL effects will not affect 
the genetic variances of individual QTLs, total genetic variance, and PVEs of 
individual QTLs. 

Figure 10.1 shows the scanning results from ICIM in one simulated DH popu- 
lation, where a, = 1.0 and aş = 1.0 in figure 10.1A, and a, = 1.0 and aş = —1.0 in 


Tas. 10.1 — Frequencies and trait values of four homozygous genotypes at two linked QTLs in 
bi-parental DH populations. 


Genotype at two QTLs Theoretical frequency Genotypic value 
1 

Qı Qı Q2Q2 20-7) lii = m+ a + az 
1 

Q: Qıqəqə ə Hə = m To — aş 
1 

nH Q2Q2 ə” Hə = m. — 44+ aş 
1 

nh 4202 z0 —1r) Həə = m -— A — A 


Notes: a, and aş are the additive effects of the two linked QTLs, and r is the recombination 
frequency. 
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figure 10.1B. The two QTLs are located on two chromosomes. Obviously, at two 
peaks on the LOD score profile, additive effects and PVEs are close to their true 
values, and the sum of the two PVEs is close to heritability in the broad sense. 

When two QTLs are linked, and their additive effects a, and a» are in the 
opposite directions (£.e., the repulsive phase), total genetic variance can be much 
lower than the sum of the variances from individual QTLs, i.e., Ve € Vi + Və, which 
may cause the situation that sum of PVEs from individual QTLs exceeds 100%. For 
example, when a, = 1.0, a = —1.Ü, r = 0.1 and V; = 0.4, we have Vi = Və = 1, 
Ve = 0.4, and Vp = 0.8. Therefore, PVE of each QTL has a theoretical value of 
125%. But total genetic variance is only 50% of the phenotypic variance, much lower 
than the PVEs of individual QTLs, and much lower than the sum of individual 
PVEs as well. 

When additive effects a, and aş are in the same direction (i.e., the coupling 
phase), total genetic variance can be much higher than the sum of variances from 
individual QTLs, i.e., Ve > Vi + Və, which may cause the situation that the sum of 
PVEs from individual QTLs is much lower than the theoretical value. For example, 
when a = 1.0, aş = 1.0, r= 0.1 and Vk = 0.4, we have Vi = Və = 1, Vg = 3.6 and 
Vp = 4. Each QTL has a theoretical PVE at 25%. However, the total genetic 
variance is 90% of the phenotypic variance, much higher than the sum of individual 
PVEs. 


A. Two unlinked QTLs, add. effects = 1, error var. = 0.4 B. Two unlinked QTLs, add. effects = 1 and -1, error var. = 0.4 


Scanning of two chromosomes, each of 120 cM 
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£ 204 
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Scanning of two chromosomes, each of 120 cM 


Fic. 10.1 — QTL mapping in one simulated population with 200 DH lines and two unlinked 
QTLs. Notes: Two QTLs are located on two chromosomes each of 120 cM in length. Additive 
effects of the two QTLs and error variance are indicated in the figure. The mapping method is 
ICIM, and the step is 1 cM in one-dimensional scanning. 


For the two linkage phases mentioned previously, there is no problem for ICIM to 
detect the two linked QTLs, and QTL positions and additive effects at two peaks on 
the LOD score profile are close to the pre-defined values (figure 10.2). In comparison 
with the independent QTL model as shown in figure 10.1, LOD scores and additive 
effects at peaks are similar, but PVEs are much different (figures 10.1 and 10.2). 
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A. Two linked QTLs, add. effects = 1, error var. = 0.4 B. Two linked QTLs, add. effects = 1 and -1, error var. = 0.4 
60 
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Scanning of two chromosomes, each of 120 cM 


Fic. 10.2 — QTL mapping in one simulated population with 200 DH lines and two linked 
QTLs in the coupling phase (A) and repulsive phase (B). Notes: Two chromosomes are 
considered, each of 120 cM in length. Two QTLs are located on the first chromosome. The 
additive effects of the two QTLs and error variance are shown in the figure. The mapping 
method is ICIM, and the step is 1 cM in one-dimensional scanning. 


When the additive effects of two linked QTLs are both positive, PVEs at peaks are 
much lower (figure 10.2A); when additive effects of two linked QTLs are at the 
opposite directions, PVEs at peaks are much higher (figure 10.2B). The major 
reason is that when QTLs are linked, total genetic variance is no longer equal to the 
sum of variances from individual loci. In other words, genetic variance is not additive 
for linked QTLs, and neither is PVE, which is normally calculated for each detected 
QTL. 

In addition to linkage, it should be noted that disequilibrium on genotypic fre- 
quencies at multiple loci, segregation distortion at an individual locus, and epistasis 
between QTLs can contribute to the non-additivity of genetic variances as well. 
Table 10.2 gives the joint frequencies and means of four homozygous genotypes at 
two QTLs, together with the marginal frequencies and phenotypic means. Marginal 
frequencies and phenotypic means calculated by row represent the genotypic fre- 
quencies and values at one QTL, and those calculated by column represent the 
genotypic frequency and values at the other QTL. Based on the marginal frequencies 
and phenotypic means, genetic variance of each QTL can be calculated. Based on 
the joint frequencies and phenotypic means, total genetic variance in the population 
can be calculated. Only under the particular situation, variance from the joint 
frequencies and phenotypic means can be equal to sum of two variances from 
marginal frequencies and means, i.e., no linkage between QTLs and no segregation 
distortion. 

Under the particular situation of no linkage and no distortion, the joint geno- 
typic frequency of each cell in table 10.2 is equal to the product of two corresponding 
marginal frequencies, such as fj; = fi. X fı. Namely, the frequency of Qı Qı at the 
first QTL and frequency of QəQə at the second QTL completely determine frequency 
of the joint genotype Q: Q: Qə Qə. Assuming there is no interaction between the two 
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TAB. 10.2 — Joint frequencies and phenotypic means of four homozygous genotypes at two 
QTLs, together with the marginal frequencies and phenotypic means calculated by rows and 
by columns. 


Genotype Marginal Marginal mean 
frequency 

AQ fits Hü fiz, Haz fa X Hi + fə X m2 

nH Pi, Hat fea, Həə q fa X Mar + fro X Həə 

Marginal | fa = fn + fa fo = ho + fro 

frequency 


Marginal | fı. X u + fo. X wor fi: X Mae + fa. X Hoş 
mean 


QTLs, the four genotypic values can be represented as the third column in 
table 10.1. Under the above situation, variance from the four joint frequencies and 
four genotypic values as given in table 10.2 is equal to sum of two marginal vari- 
ances. It should be noted that marginal frequencies have to be used in calculating 
marginal variance. From the previous analysis, it can be seen that the joint variance 
is additive to marginal variances only when the joint genotypic frequencies are equal 
to products of the corresponding marginal frequencies, and the additive and dom- 
inant genetic model applies to genotypic values. This is also the condition for the 
genetic variances and PVEs of individual QTLs to be additive. 

When two QTLs are linked in one DH population with no segregation distortion, 
two alleles at each locus have the same frequency at 0.5. But the joint frequency at the 
two loci is not equal to 0.25. Disequilibrium observed in table 10.1 is caused by genetic 
linkage. In population genetics, linkage is not the only factor in disequilibrium. 
Selection and random drift during the development of mapping populations can 
cause disequilibrium on genotypic frequencies (see §9.5.3 in chapter 9); epistatic 
interactions between alleles from different loci can cause the non-additivity on 
genotypic variances. In pactical mapping populations, disequilibrium may occur on 
the joint genotypic frequencies, and at the same time epistasis may occur on the joint 
genotypic values, making it complicated to judge whether the total genetic variance 
is reduced or increased. 

Therefore, it may not be appropriate to sum up the PVEs on individual QTLs in 
QTL mapping studies. If really wanted, PVEs of individual QTLs should be 
adjusted by the genetic variance at the condition of equilibrium. The latest versions 
of software packages QTL IciMapping, GACD, and GAPL do the adjustment 
automatically, and then the adjusted PVEs can be summated and used as the 
estimate for the phenotypic variance explained by all detected QTLs. 


10.1.3 Phenotypic Contribution and the QTL Detection 
Power 


In theory, statistical power in hypothesis tests can be increased either by the increase 
in sample size, by reducing the random error, or by both. In QTL mapping, it means 
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a larger mapping population and smaller random error in phenotyping. Reducing 
the random error in phenotypic data can increase the trait heritability and PVEs of 
individual QTLs, and finally, increase the QTL detection power. Taking ICIM as an 
example, QTL detection power and false discovery rate (FDR) at different levels of 
PVE in RIL populations with different sizes are shown in figure 10.3 (Li et al., 2010). 
In populations with sizes 100, 200, and 400, detection powers in the 10 cM support 
interval are estimated at 29%, 67%, and 91%, respectively, for one QTL with 
PVE = 4%. The three powers are increased to 79%, 97%, and 100%, respectively, for 
the QTL with PVE = 10%. 

The reduction of random errors in phenotypic values can increase the contri- 
bution of individual QTLs indirectly in the population. When the contribution of 
one QTL is increased from 4% to 5%, the detection powers can be increased from 
29%, 67% and 91% to 44%, 77%, and 94% for population sizes 100, 200, and 400, 
respectively. Therefore, in QTL mapping studies, mapping populations should be as 
large as possible; random errors should be well-controlled in phenotyping trials. Of 
course, for those traits with significant genotype by environment interactions, 
phenotyping trials should be conducted in multiple locations and/or years as well. 


Power and FDR (%) 
a 
o 
7, 


Population size 


Fic. 10.3 — Detection powers of QTLs at different levels of phenotypic variation explained 
(PVE), and the false discovery rate (FDR) in RIL populations at different sizes. 


The reduction of background genetic variation in the mapping population can 
also increase the PVE of the target QTL (see §9.4 in chapter 9), and therefore 
increase its detection power. The use of iso-genic lines and chromosomal segment 
substitution lines can completely control the background genetic variation, and 
therefore maximize the detection power of QTL in interest. As an example, assume 
genetic variances of three independent QTLs, i.e., QI, Q2, and Q3, are equal to 0.1, 
0.2, and 0.3, and the error variance is equal to 0.4 in one population. PVEs of the 
three QTLs are equal to 10%, 20%, and 30%, respectively, when all QTLs are in 
segregating in the population. Under the same error variance, phenotypic variance is 
equal to 0.5 in the iso-genic population when only Q1 is in segregating; equal to 0.6 
when only Q2 is in segregating; and equal to 0.7 when only Q3 is in segregating. 
Genetic effects and genetic variances of the three QTLs do not change, but due to 
the control of background genetic variance, their PVEs are increased by 20%, 33%, 
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and 43%, respectively. Therefore, the detection powers are also increased if such 
populations are developed and used. It is worth mentioning that higher detection 
power is also associated with a narrower confidence interval on the detected QTL, 
which is helpful for further studies on QTL fine-mapping and map-based cloning. 


10.2 On the Use of Composite Traits in QTL Mapping 


10.2.1 Composite Traits and Their Applications 
in Genetic Studies and Breeding 


In most QTL mapping studies, phenotypic values are directly observed or measured 
on individuals or lines grown under specific environments such as the greenhouse or 
field. To seek stable QTLs across environments, the best linear unbiased estimates 
(BLUE) across multiple environments are occasionally used as well (see §6.4, 
chapter 6). In other studies, the phenotypic values of a trait of interest can be 
mathematically derived from some directly-measured traits, either by addition, 
subtraction, multiplication, or division. For convenience, traits having their own 
measurements and used directly in QTL mapping are called component traits or 
simply components; those mathematically derived from two or more component 
traits and then used in QTL mapping are called composite or derived traits (Wang 
et al., 2012; Li et al., 2010). 

Composite traits are often used in genetics and breeding. In maize, the 
anthesis-silking interval (ASI) is an important agronomic trait, closely related to 
grain yield, drought tolerance, and evolution of the species (Buckler et al., 2009; 
Messmer et al., 2009; Sari-Gorla et al., 1999; Bolanos and Edmeades, 1996; 
Ribaut et al., 1996). Phenotypic value on ASI from a single maize plant is defined as 
the difference between male flowering time (MFLW) and female flowering time 
(FFLVV). Since the direct selection on drought tolerance per se is difficult in maize, 
ASI has been identified to be an efficient indicator, which is often used in selecting 
the drought tolerant lines in maize breeding (Ribaut and Ragot, 2007; Sari-Gorla 
et al., 1999). Ribaut et al. (1996) used 142 molecular markers to identify the genomic 
regions responsible for the expression of ASI in an F» population consisting of 234 
individuals, with the aim to develop marker-assisted selection strategies for 
improving drought tolerance (Ribaut and Ragot, 2007; Ribaut et al, 1997). 
Mapping results showed that four QTLs were common for MFLW and FFLW, one 
for ASI and MFLW, and four for ASI and FFLW. Two ASI-only QTLs were iden- 
tified, one on chromosome 2 (identified under the well-watered conditions) that 
explained 11.4% of the phenotypic variance, and one on chromosome 6 (identified 
under the severe stress conditions) that explained 13.0% of the phenotypic variance. 
Neither of the four common QTLs was found by MFLW and FFLW. In a population 
of 142 recombinant inbred lines (RILs), Sari-Gorla et al. (1999) identified five 
MFLW QTLs, zero FFLW QTL, and seven ASI QTLs in the well-watered envi- 
ronments. The ASI QTL identified on maize chromosome 9 was not identified by 


More on the Frequently Asked Questions in QTL Mapping 451 


MFLW and FFLW. In the water-stressed environments, four MFLVV QTLs, two 
FFLW QTLs, and two ASI QTLs were identified. The ASI QTL identified on maize 
chromosome 5 was not identified by MFLW and FFLW. 

In rice, grain shape (GS) is an important grain quality trait defined by the ratio 
of grain length (GL) to grain width (GW) (Wan et al., 2005; Aluko et al., 2004; 
Li et al., 2004; Rabiei et al., 2004; Tan et al., 2000; Redona and Mackill, 1998). In an 
Fə population consisting of 204 individuals and 116 molecular markers, Redona and 
Mackill (1998) identified seven QTLs for GL, four for GW, and three for GS. The 
three GS QTLs were located on chromosomes 3 and 7 that coincided with QTLs for 
GL and GW. In the Fə. and RIL populations derived from an elite hybrid rice 
cultivar, Tan et al. (2000) found that the major-effect QTLs for GL, GW, and GS 
were detected in both populations using paddy rice and brown rice, whereas the 
minor-effect QTLs were detected only occasionally. In a rice BC3F, population 
consisting of 308 families, Li et al. (2004) identified two QTLs for GL located on 
chromosomes 3 and 10, and one QTL for GW located on chromosome 12. Two QTLs 
for GS were identified at similar chromosomal positions as the two GL QTLs. In an 
Fə population consisting of 192 individuals, Rabiei et al. (2004) identified a total of 
18 QTLs, five for GL, seven for GW, and six for GS. Among the 18 QTLs, there was 
one major QTL specific for GS, i.e., not detected either by GL or by GW, explaining 
15% of the phenotypic variance in GS. 


10.2.2. QTL Mapping on Component and Composite Traits 
in One Maize RIL Population 


As indicated above, QTL mapping from composite traits sometimes shows a dis- 
crepancy with the results from their components. Composite-only QTLs, i.e., 
QTLs detected by composite trait but not by any component traits, have been 
occasionally reported (Wan et al, 2005; Rabiei et al., 2004; Tan et al., 2000; 
Sari-Gorla et al., 1999; Ribaut et al., 1996). Where did the composite-only QTLs 
come from? To what extent can we trust the composite-only QTLs and use this 
information in breeding or other genetic studies, such as QTL fine-mapping, gene 
cloning, and marker-assisted selection? One RIL population in maize is used below 
to further illustrate the different mapping results from component and composite 
traits. In the example population, it should be noted that some composite traits 
may not have any biological meaning. 

The population is one from the maize NAM design (Buckler et al., 2009), 
consisting of 187 RILs. The linkage map is based on 756 markers, which covers 
1380.8 cM of the ten chromosomes in maize, with an average distance of 1.85 cM 
between adjacent markers. A component trait I is the female flowering time 
(FFLVV), and trait II is the male flowering time (MFLW). The minimum, mean, 
and maximum phenotypic values (i.e., days) are 73.44, 81.47, and 91.11 for trait I, 
and 72.50, 78.40, and 86.78 for trait II, respectively. The correlation coefficient is 
equal to 0.86 between FFLW and MFLW (figure 10.4). Phenotypic values of four 
composite traits are addition, subtraction, multiplication, and division of the 
two component traits. 


452 Linkage Analysis and Gene Mapping 


y 1.05x - 0.69 
90 - R? — 0.74 


Female flovvering day (FFLVV) 


70 75 80 85 90 95 
Male flovvering day (MFLVV) 


Fic. 10.4 — Correlation between female and male flowering days in one RIL population in 
maize. 


For two component traits in the maize RIL population (i.e., FFLW and MFLVV), 
11 additive QTLs are detected to be distributed on eight of the ten maize chro- 
mosomes (table 10.3), denoted by qZ1-qZ11, where q stands for QTL and Z for the 
species name Zea mays L. qZ1 and qZ2 are located on chromosome 1; qZ3, qZ4, and 
qZ5 are located on chromosome 2; and the other six QTLs were located on different 
chromosomes. Seven QTLs control either FFLW or MFLW, but not both, explain- 
ing 59.14% of the phenotypic variance for trait I, and 60.35% of the phenotypic 
variance for trait II. The total PVE from all detected QTLs on each component trait 
is estimated by the determinant coefficient in the regression of phenotype on 
flanking markers of the detected QTLs. Three unlinked QTLs (i.e., qZ4, qZ10, and 
qZ11) control two component traits simultaneously, and their effects are estimated 
in the same direction (table 10.3), which is understandable by considering the 
positive correlation between the two components (figure 10.4). In addition, qZ3, 
qZ6, and qZ11 each explain more than 10% of the phenotypic variance on compo- 
nent I; qZ4 and qZ11 each explain more than 10% of the phenotypic variance on 
component II. Therefore, qZ3, qZ4, qZ6, and qZI11 can be treated as four 
major-effect QTLs on two component traits (table 10.3). 

Seven out of the 11 QTLs detected by two component traits are also detected by 
the addition composite trait (table 10.3), including the four major-effect QTLs 
mentioned earlier. Four of the 11 QTLs are not identified by addition, i.e., qZ2, qZ5 
qZ7, and qZ8 (table 10.3). As for the seven QTLs that are also identified by addi- 
tion, the estimated positions are similar to those from the two components, and the 
additive effects are close to the sum of the effects on component traits (table 10.3). In 
addition, two additional QTLs are detected and located on chromosome 5 in the 
repulsive phase, at a distance of 97 cM from each other. For subtraction, most of the 
11 QTLs are not identified. For the four major-effect QTLs, only one (i.e., qZ11) is 
detected at a similar position. However, five additional QTLs are detected and 
located on chromosomes 3, 5, and 10. For multiplication, most identified QTLs are 
similar to those identified by the addition composite trait, including the two QTLs 
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Tas. 10.3 — QTL mapping results from two component traits (7.e., FFLW and MFLW) and 
four composite traits in one maize RIL population. 


Trait QTL Chromosome Position LOD Additive PVE 
name (cM) score effect (%) 
Component I qZ1 1 87 7.56 0.7496 8.05 
(FFLW) qZ3 2 52 17.33 1.2165 20.47 
qZ4 2 77 5.04 0.6483 5.10 
qZ6 3 57 11.73 0.9558 12.99 
qZ7 4 61 3.08 —0.4703 3.15 
qZ10 7 46 4.73 —0.5835 4.78 
qZ11 9 42 12.61 0.9959 14.23 
Component 11 qZ2 1 129 2.57 0.3561 2.68 
(FFLW) qZ4 2 77 9.49 0.7615 10.24 
qZ5 2 108 5.71 0.5704 5.62 
qZ8 5 70 3.09 0.3875 2.93 
qZ9 6 27 2.87 —0.3693 2.88 
qZ10 7 49 4.77 —0.4757 4.73 
qZ11 9 40 11.01 0.7396 11.56 
Addition qZ1 1 87 6.88 1.1885 6.53 
qZ3 2 52 13.24 1.7202 13.21 
qZ4 2 TT 6.02 1.1830 5.49 
qZ6 3 57 13.6 1.7287 13.72 
qZ9 6 28 2.63 —0.7126 2.35 
qZ10 7 46 6.15 —1.1128 5.62 
qZ11 9 40 14.73 1.7997 14.98 
5 1 2.80 —0.7590 2.55 
5 98 7.42 1.2351 7.00 
Subtraction qZ1 1 87 4.21 0.3319 6.12 
qZ8 5 69 6.53 —0.4414 10.08 
qZ11 9 42 2.89 0.2735 4.16 
3 93 3.57 —0.3077 5.28 
3 103 5.39 0.3782 7.98 
5 1 3.47 —0.3084 5.06 
5 98 4.39 0.3435 6.50 
10 91 3.40 0.2953 4.87 
Multiplication qZ1 1 87 6.49 94.2285 6.37 
qZ4 2 77 4.85 86.4043 4.54 
qZ5 2 107 2.84 69.1877 2.75 
qZ6 3 57 13.12 138.6902 13.70 
qZ9 6 28 2.58 —57.8760 2.40 
qZ10 7 46 6.92 —97.2126 6.65 
qZ11 9 40 14.74 147.5234 15.62 
5 1 3.24 —67.0759 3.09 
5 98 7.46 101.4883 7.33 
Division qZ8 5 69 3.16 —0.0042 5.78 
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which are not identified by two component traits. One major-effect QTL, i.e., qZ3, is 
not detected by the multiplication composite trait. Results for division are the 
worst. Only one QTL, i.e., qZ8, is detected. None of the four major-effect QTLs are 
detected by the division composite trait (table 10.5). 

The composite-only QTLs happened in the maize population as well. Two linked 
QTLs on chromosome 3, and one QTL on chromosome 10 are only identified by 
subtraction. Two QTLs located at 1 and 98 cM on chromosome 5, respectively, are 
only identified by addition, subtraction, and multiplication. None of them are 
identified by either component (table 10.5). However, the composite-only QTLs may 
be explained by the less-significant peaks in LOD score profiles from the component 
traits. For example, there is a peak in the LOD score profile from component trait I 
near 1 cM of chromosome 5, where the LOD score is equal to 2.29, and the additive 
effect is estimated at —0.4096 (see the dashed arrow in figure 10.5). No clear peak 
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Fic. 10.5 — One-dimensional scanning at step 1 cM on two component traits (7.e., female and 
male flowering days), and four composite traits (i.e., addition, subtraction, multiplication, 
and division of the two component traits) in one RIL population in maize. Notes: For easier 
comparison, LOD profiles from multiplication, subtraction, addition, component trait II, and 
a component trait I are added by 20, 40, 60, 80, and 100, respectively. Eleven QTLs detected 
by two component traits are denoted by qZ1-q211. Arrows in component traits point to two 
peaks lower than the LOD score threshold. Arrows in composite traits point to peaks higher 
than the LOD score threshold, but no obvious peaks are observed at similar positions on the 
LOD score profiles from component traits. 
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can be observed in the LOD score profile of component trait II around this position, 
so its effect on component II can be viewed as 0. It is expected that if this QTL can 
be identified by addition, subtraction, and multiplication, it will have some negative 
effects as well. In fact, the additive effects are estimated at —0.7590, —0.3084, and 
—67.0759 by addition, subtraction, and multiplication, respectively (table 10.3). 
Though the peak in the LOD profile does not exceed the LOD score threshold of 2.5, 
this QTL is very much likely to be the same as those identified at similar positions by 
addition, subtraction, and multiplication. 

Two linked QTLs on chromosome 3 identified by subtraction could be the same 
as the peak having a LOD score of 2.09 from component trait II (see the dashed line 
in figure 10.5). However, the QTL identified by three composite traits to be located 
at 98 cM on chromosome 5 is an exception (figure 10.5). There are no clear peaks in 
the LOD score profiles from both component traits around position 98 cM on 
chromosome 5. From the power simulation studies given below, we may conclude 
that the composite-only QTLs which cannot be well explained by the mapping 
results from component traits are highly likely to be false positives. It is highly 
unlikely that there is any QTL that only affects the composite traits but without any 
effects on the component traits. 


10.2.3 Genetic Effects and Genetic Variances 
on Composite Traits 


Both theoretical deduction and simulation approaches are adopted to illustrate the 
genetic characteristics of composite traits and potential problems in QTL mapping. 
For more details, readers can refer to Wang et al. (2012). A total of four QTLs are 
considered, i.e., Qı and Qə affect component trait I; Q and QA affect component 
trait IL Their additive effects on two component traits are represented by az, də, aş, 
and a4, respectively. No interactions between Q, and Qə, and between Q; and Q, are 
considered at first. Assume the mapping population consists of a number of RILs 
derived from a bi-parental cross. The population has a mean value mı for component 
trait I, and mean value mə for component trait II. There are four homozygous 
genotypes for each component trait. Genotypic values on a trait I are represented by 
Gü, Giz, Gi3, and Ga, genotypic values on trait II are represented by G21, G22, Go3, 
and Gə4. Under the additive genetic model, the relationship between genotypic 
values and additive effects is given in equation 10.7. 


Gi = mi Ti + aş, Gi = mi + a — ap, Giş = m — a + a, Gia = mi — a — op, 


Go, = mə + a3 + a4, G22 = mə + aş — qa, G23 = M2 — aş + ay, Göq = M2 — aş — di 
(10.7) 


When the four QTLs are considered jointly, sixteen homozygous genotypes 
occur in the RIL population. Genotypic values on composite traits can be calcu- 
lated from the genotypic values on two component traits (table 10.4), from which 
the population mean and genetic variance of each composite trait can be calcu- 
lated as well. 


TAB. 10.4 — Genotypes and genotypic values of component and composite traits in a four-QTL model, two affecting each component trait. 


No. Genotype Genotypic value 

Qi Qə Q3 QA Trait I Trait I Addition Subtraction Multiplication Division 
1 QQ QQ Q Q3 QQ Gu Gö Gu + Ga Gu — Gö Gi X Ga G1 / Ga 
2 QQ Q Q2 Q3 Q3 qaqa Gi G2 Gir + Gog Gi — Gaz Gir X Gog Gi1/ G22 
3 QıQı Q2Qe 9393 QıQı Gi G23 Gi + G23 Gi — Gy Gi X Go3 Gi / G23 
4 QQ Q Q2 1393 qaqa Gi Goa Git + Goa Gu — Goa Gi X Gog Gii/ Goa 
5 QQ 92.92 Q3.Q3 Qi Qs Gi? Gö Gi, + Gö Giz — Ga Gi, X Gü Gi2/ Gö: 
6 QQ 9292 Q3Q3 qaqa Gi? Goo Giz + Goo Giz — Goo Giz X Gye Gh2/ Goz 
7 QQ 92.92 9393 Qi Qs Gi? G23 Gi, + G23 Giz — G23 Giz X Gö Gi2/ G23 
8 QıQ: 92.92 1393 qaqa Giz Goa Gi, + Goa Giz — Goa Gi, X Gog Gi2/ Goa 
9 qıq QQ Q3 Q3 QıQı Gis Gö Gig + Gö Gi3 — Gay Giz X Ga Gi3/ Gö: 
10 qıq Q.Qə Q3 Q3 qaqa Gis Gop Giz + Gog Gi3 — Gre Gi3 X Gog Gi3/ G22 
11 qıq QQ 4303 QQ Gis Gog Gis + Go3 Gis — Gog Giz X Gog Gi3/ Gog 
12 qıq Q2Qe 9393 qaqa Gis Goa Giz + Goa Giz — Goa Gi3 X Gog Gi3/ Goa 
13 qıq 92.92 Q3 Q3 Qi Qs Gi4 Gö Gu + Gö Gu — Ga Gia X Ga Gi4/ Gar 
14 qıq 92.92 Q3 Q3 qaqa G4 Goo Gia + Gog Giu — Gore Gia X Gog Gi4/ G22 
15 qıq 92.92 9393 Qi Qa Gu G23 Gia + Go3 Gia — Gas Gia X Go3 Gi4/ G23 
16 nd 92.92 93.93 qada Gi4 Goa Gia + Goa Gia — Goa Gia X Gog Gi4/ Goa 
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On the other hand, from the 16 genotypic values as given in table 10.4 for each 
composite trait, one overall mean (denoted by M) on the trait, and 15 genetic effects 
can be calculated by equation 10.8, where the 16 genotypic values are represented by 
G-Gi¢, respectively; A; (i = 1, 2, 3 and 4) denotes the additive effects of the four 
QTLs; Ay denotes the additive by additive epistatic effects between two QTLs (i, 
j= 1, 2, 3 and 4, and i# j); Ayx denotes the additive by additive epistatic effects 
among three QTLs (i, j, k = 1, 2,3 and 4, andi#j,i#k, j Z k); and Ajo34 denotes 
the epistatic effects among the four QTLs. 


Gi 1 1 1 1 1 1 1 1 1 1 1 1 1 
Gİ İİ 1 1 1 -1 1 1 -1 1 -1 -1 1 :—1 -1 —l -1 
Gi 111 21-1 #2 1-1 1-1 i -1 =i 1 —l -1 =1 
Gi 1 1 1-1 1 -1 -1 -1 1 -1 -1 1 1 
G} |1 1 -1 1 1 -1 1 1 -1 -1 1 -1 -1 1 :—1 -i 
Gi} İl 1-1 1 1 -1 -1 1 -1 -1 1 -1 1 1 
G| İl 1 -1 -1 1 1 1 -1 -1 1 -1 -1 1 1 
G] __ |1 1 =1 -1 -1 -1 -1 -1 1 1 1 1 1 1 +1 -1 
Gl J 1 boğ i =i =l i 1 i 1 =L =i =i J —i 
Gü 1-1 2 1 sf -1 -1 1 1 -=i =i =i 01 ï =i 1 
Gil |1 -1 1 -1 1 —1 -1 1 -1 1 -1 1 -1 1 
Gis 1-1 1-1 İ 1 -1 -1 1 1 1 -1 1 =l 
Gis 1-1 -1 1:1 1 =i =i =i =i 0101 1 :—1 =f 1 
Gal Jl -2 -1 1-1 1-1 1-1 1-1 1-11 1-1 
Gs i si -1 -1 2 21 1-1 1 -1 -1 -I 1:1) 1 «S11 
Gie 1 —1 -1 -1 — 1 1 1 1 1 :—1 -1 -1 —l 1 

M 

A; 

Ay 
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As 

A 

Alış 

x | Au 

A23 

Ası 

A34 

A123 

Alsa 

Aisa 

A234 

| Aizs4 

(10.8) 


The theoretical genetic effects of each composite trait can be calculated by 
equation 10.8. To further demonstrate the theoretical genetic effects and genetic 
variance of composite traits, tables 10.5 and 10.6 give three distribution models and 
three effect models on the four QTLs, respectively. The two tables also provide the 
required information for simulating the mapping populations. Assume there are ten 
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chromosomes in a genome, each of 150 cM in length and evenly distributed with 16 
markers. In distribution A, the four QTLs are located on chromosomes 1-4, and 
their chromosomal positions are 18, 28, 53, and 63 cM, respectively. A linkage dis- 
tance of 35 cM is considered in distributions B and C on the first two chromosomes. 
In distribution B, Qı and Qə both affecting composite trait I are linked on chro- 
mosome 1: Qə and Q, both affecting composite trait II are linked on chromosome 2. 
In distribution C, Q: and Q3 are linked on chromosome 1; Qə and Qa are linked on 
chromosome 2 (table 10.5). 


TAB. 10.5 — Three distribution models (7.e., A, B, and C) for four QTLs, two affecting each of 
the two component traits. 


QTL Trait affected Distribution A Distribution B Distribution C 
Chrom. Pos. (cM) Chrom. Pos. (cM) Chrom. Pos. (cM) 

Qi Component I 1 18.0 1 18.0 1 18.0 

Qə Component 1 2 28.0 1 53.0 2 28.0 

Q3 Component 11 3 53.0 2 28.0 1 53.0 

QA Component II 4 63.0 2 63.0 2 63.0 


For each distribution model, three genetic effect models are assumed (table 10.6). 
Effect A is a pure additive model, where the component QTLs only have additive 
effects. Effect B is an additive and epistasis model, where the component QTLs 
have both additive and epistatic effects. Effect C is a pure epistasis model, where the 
component QTLs have epistatic effects, but do not have any additive effects. In effect 
A, the additive effects of the four QTLs are all set at 1.0. No epistatic effects are 
present between Q: and Qə, and between Qə and QA. In effect B, additive effects of 
the four QTLs, and additive by additive epistatic effects between Q, and Qə, and 
between Qə and Q; are all set at 1.0. In effect C, additive by additive epistatic effects 
between Q, and Qə, and between Q; and Q; are both set at 1.0. None of the four 
QTLs has any additive effect. The mean value is set at 25 for trait I, and 20 for a 
trait II, for the three effect models. 


TAB. 10.6 — Three genetic effect models (i.e., A, B, and C) for the four QTLs defined in 
table 10.5, two affecting each of the two component traits. 


Genetic effects Effect A Effect B Effect C 
Additive effect of Qı on trait I (a) 

Additive effect of Qə on trait 1 (aş) 
Additive effect of Q: on trait II (aş) 
Additive effect of Q, on trait II (a4) 
Epistasis between Qı and Qə on trait I (a2) 
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1 
1 
1 
0 
0 


Epistasis between Q; and Qa on trait II (aa3,) 
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Table 10.7 gives the genetic effects, genetic variances, and heritabilities 
of component and composite traits under the QTL effect model A (defined in 
table 10.6). As expected, for component trait I, M = mı, 4) = aq, Aş = dq, and other 
genetic effects are equal to 0. For component trait II, M = mə, Aş = aş, Aq = a4, and 
other genetic effects are equal to 0. For composite trait addition, M = m +m, 
A, = q, Ao = dy, Aş = aş, Ay = ay, and other genetic effects are equal to 0. For 
subtraction, M = m, — mə, A, = a, Ag = aş, Aş = —aş, A4 = —ay, and other 
genetic effects are equal to 0. Interestingly, there are di-genic epistatic effects for 
multiplication and di-genic and tri-genic epistatic effects for division. Obviously, 
more QTLs are involved in composite traits addition and subtraction in comparison 
with the component traits. But the types or meanings of genetic effects keep 
unchanged. For composite traits multiplication and division, in addition to the 
increased QTL number, other types of genetic effects are present as well. It can be 
imagined that for more complex genetic models, such as effect models B and C as 
defined in table 10.6, higher-order of epistatic effects would be present as well. 
Therefore, genetic architectures associated with multiplication and division become 
much more complicated, due to the presence of the high-order of epistatic effects. 

In addition to different sizes and types of genetic effects between component and 
composite traits, it can be seen from table 10.7 that genetic variances and heri- 
tabilities are much different as well. When epistasis is not considered, genetic variance 
from multiple QTLs with linkage in the RIL population is given by equation 10.9. 
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where q is the number of additive QTLs; a; and a; are the additive effects of the ith 
and jth QTLs; Ri, is the accumulated recombination frequency between the ith and 
jth QTLs during the repeated selfing. In table 10.7, genetic variances on two 
component traits, and composite traits addition and subtraction are calculated by 
equation 10.9. The theoretical formula for genetic variances on composite traits 
multiplication and division are much more complicated due to the presence of 
epistasis. Genetic variances given in table 10.7 are calculated from a large simulated 
population. By definition, the broad-sense heritability (H°) is the total genetic 
variance (Vg) divided by the total phenotypic variance (Vp), i.e., equation 10.10. 
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(10.10) 


Assume the error variance (V,) is equal to 4.67 on both component traits. 
In QTL distribution models A and C, heritabilities are equal to 0.30 for both 
component traits. In QTL distribution model B, the two heritabilities are equal to 
0.39. Error variances on composite traits addition and subtraction are equal to the 
sum of error variances on two component traits. Therefore, their heritabilities can 
also be directly calculated, which are also given in table 10.7. Error variances on 
composite traits multiplication and division cannot be easily acquired; their heri- 
tabilities are not given in table 10.7. It can be seen from table 10.7 that the genetic 
complexity in composite traits arises from three aspects: (1) the number of QTLs 


TAB. 10.7 — Genetic effects, genetic variances, and heritabilities of component and composite traits under QTL effect model A as defined in 


table 10.6. 
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—0.0025 
—0.0025 
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involved, (2) the higher order of gene interactions, and (3) the linkage relationship 
between QTLs. The increased number of QTLs, and the presence of epistatic effects 
and genetic linkage will reduce the efficiency of QTL mapping due to the decreased 
proportion of the phenotypic variance explained by individual QTLs and the 
increased difficulty in controlling the background genetic variation. 


10.2.4 Power Analysis in QTL Mapping on Composite 
Traits 


Using the QTL IciMapping software, 1000 populations, each consisting of 
200 RILs, were generated for each QTL distribution and effect model as defined in 
tables 10.5 and 10.6. Phenotypic values of component traits were calculated from the 
corresponding genetic models, from which the composite traits were calculated by 
direct operations. QTL mapping in each simulated population was conducted by 
ICIM implemented in the software. The probabilities of a marker entering the model 
and leaving out of the model were set at 0.01 and 0.02, respectively. The LOD score 
threshold of 2.5 was used to declare the significant QTLs. Two methods to calculate 
the QTL detection power have been described in chapter 5. 

In comparison with component traits, QTL detection power is reduced and false 
discovery rate (FDR) is increased with composite traits for each distribution and 
effect model. Results for the three distribution models under effect model A are 
shown in table 10.8. In effect model A, the four QTLs have an equal additive effect. 
It can be seen that the detection powers in the 10 cM support interval are consis- 
tently high for component traits under the three QTL distribution models, i.e., 
91.90%-95.40% (table 10.8). Detection powers are slightly lower for Qə and Qa in 
distribution model B, due to the linkage between Q: and Qə, and between Qə and 
QA. FDR is around 22%, and the estimates on position and effect are close to their 
true values regardless of the distribution models. 

Much lower detection power and higher FDR are observed for the composite traits 
under the three distribution models (table 10.8). Addition and subtraction have 
similar detection powers in distribution models A and B, which is about 25% lower 
than the detection power from component traits. In distribution model C, the 
detection power from subtraction is reduced by 30%-50% due to the repulsive linkage 
between Q; and Q3, and between Qə and Q, on the subtraction composite trait. For the 
three distribution models, the detection power from multiplication is lower than that 
from addition; the detection power from the division is lower than that from 
subtraction. 

The reduction in detection power when using composite traits can be explained 
by the larger QTL number and by the fact that more complicated genetic effects are 
associated with the composite traits. For effect model A, only additive effects of two 
QTLs were involved in each component trait. However, four QTLs affect each 
composite trait. Other genetic effects than additives are present for composite traits 
multiplication and division, which makes it complicated to control the background 
variation during the one-dimensional scanning of additive QTLs. The increased 
QTL number and more complicated genetic effects reduce the PVE of individual 
QTL and, consequently, the QTL detection power. For example, in distribution 


TAB. 10.8 — Detection power and false discovery rate (FDR) by using the component and composite traits in QTL mapping for the three QTL 


distribution models (defined in table 10.6) under effect model A (defined in table 10.5). 


Model 
Distribution model A 
(table 10.5) 


Distribution model B 
(table 10.5) 


Parameter 
Power (%) 


FDR (%) 


Position (cM) 


Additive eff 


Power (%) 


FDR (%) 


ect 


Position (cM) 


Additive eff 


ect 


Trait I 
95.10 
94.80 


21.63 
18.54 
28.46 


1.00 
1.01 


95.40 
92.90 


21.35 
18.46 
52.80 


1.01 
1.01 


Trait 11 


92.50 
94.50 
22.98 


52.65 
62.85 


1.00 
1.00 


93.70 
91.90 
22.18 


28.49 
62.86 


1.03 
1.00 


Addition 
69.60 
69.80 
67.20 
68.40 
27.42 
18.55 
28.49 
52.68 
62.83 
1.10 
1.09 
1.11 
1.10 
67.40 
62.40 
69.90 
62.40 
28.76 
18.43 
52.63 
28.52 
62.75 
1.16 
1.16 
1.15 
1.12 


Subtraction 
69.30 
70.40 
65.30 
65.40 
28.05 
18.62 
28.38 
52.61 
62.63 
1.11 
1.11 
—1.11 
—1.12 
65.60 
66.00 
67.00 
64.90 
28.59 
18.66 
52.43 
28.64 
62.46 
1.15 
1.16 
—1.16 
—1.14 


Multiplication 
55.20 
54.10 
76.90 
77.80 
28.07 
18.36 
28.44 
52.75 
62.88 
23.32 
23.42 
26.46 
26.61 
54.80 
50.00 
79.20 
73.50 
28.07 
18.51 
52.48 
28.60 
62.79 
25.40 
25.12 
27.47 
26.61 


Division 
50.50 
50.90 
75.20 
75.20 
29.68 
18.45 
28.52 
52.65 
62.58 
0.06 
0.06 
—0.07 
—0.07 
49.90 
49.90 
74.90 
72.90 
28.89 
18.73 
52.39 
28.70 
62.52 
0.07 
0.07 
—0.07 
—0.07 
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Model 
Distribution model C 
(table 10.5) 


Parameter 
Power (%) 


FDR (%) 
Position (cM) 


Additive effect 


Tas. 10.8 — (continued). 


Trait I 


95.20 
95.00 


19.78 
18.51 
28.45 


1.00 
1.01 


Trait II 


92.90 
92.60 
23.44 


52.83 
62.82 


0.99 
0.99 


Addition 
66.60 
69.20 
63.40 
61.50 
28.83 
18.45 
28.55 
52.62 
62.69 
1.16 
1.16 
1.16 
1.12 


Subtraction 
52.40 
51.60 
47.80 
49.90 
27.71 
18.47 
28.44 
52.66 
62.75 
1.12 
1.12 
—1.12 
—1.11 


Multiplication 
53.60 
54.70 
69.70 
72.60 
29.74 
18.50 
28.61 
52.60 
62.71 
24.76 
24.88 
27.88 
27.17 


Division 
37.70 
36.40 
56.20 
58.00 
30.18 
18.40 
28.56 
52.65 
62.83 
0.06 
0.06 
—0.07 
—0.07 
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model A, Q: explains 50% of the genotypic variance for component trait I. When 
addition or subtraction is used, Q, explains 25% of the genotypic variance (calcu- 
lated from table 10.7). When multiplication and division are used, Qı only explains 
19.46% of the genotypic variance (calculated from table 10.7). As mentioned in 
§10.1, PVE is one major determining factor in detection power. The reduced PVE of 
individual QTL in composite traits is the reason for the reduced detection power, as 
has been observed in table 10.8. 

The detection power given in table 10.8 is based on a support interval of 10 cM in 
length, where the true QTL is located at the center of the interval. One pre-defined 
QTL is counted to be correctly mapped when one QTL is present within the support 
interval in the simulated population. QTLs located out of the support intervals of 
the four pre-defined QTLs are counted as false positives. The distribution of false 
positives on the genome cannot be seen by the power analysis based on support 
intervals but can be seen by the power analysis based on marker intervals. 
Figure 10.6 shows the detection power counted by marker intervals on 10 chromo- 
somes for QTL effect model A (table 10.5) and distribution model B (table 10.6). 
Other models have a similar trend. Four clear peaks can be observed around the four 
pre-defined QTLs on component and composite traits (figure 10.6). Powers are close 
to 0 in other chromosomal regions. Powers on marker intervals in the whole genome 
failed to show any chromosomal regions with significant occurrence of QTLs when 
the four composite traits are used in QTL mapping. In addition, results from effect 
model C also indicate that the epistasis between QTLs controlling the component 
traits cannot cause the composite-only QTLs either (Wang et al., 2012). 


Component trait I 


Component trait II 


rh nA Addition trait 
300 - 

| Subtraction trait 
200 

. l i Multiplication trait 
100 + 

“2 R i Division trait 

0 Ev $ 


1111222233334445 5556666777788899 9 910101010 
Marker intervals (each dot represents one interval, and number represents chromosome) 


Power by marker interval (%) 


Fic. 10.6 — Detection power (%) counted by marker intervals on 10 chromosomes for QTL 
effect model A (table 10.5) and distribution model B (table 10.6). Notes: For convenience, 
powers from multiplication, subtraction, addition, component trait II, and component trait I 
were added by 100, 200, 300, 400, and 500, respectively. 


Table 10.9 gives the simulation results based on the observed distributions and 
effects of eleven QTLs identified in the maize RIL population (table 10.3). When 
trait I was used in QTL mapping, detection powers of the four major-effect QTLs, 
i.e., qZ1, qZ3, qZ6, and qZ11, are higher than 80% due to their large genetic effects. 
The detection power of qZ4 was lower than that of qZ3 (table 10.9) since qZ4 is 
linked to qZ3 on chromosome 2, and the additive effect of qZ4 is smaller (table 10.3). 
As expected, qZ7 has the smallest effect among the seven QTLs for trait I 
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(table 10.3), and thus has the lowest detection power (table 10.9). Similarly, qZ4, 
qZ5, qZ10, and qZ11 are the top four QTLs controlling trait II, and their detection 
powers are higher than 85% (table 10.9). qZ2 is the smallest among the seven QTLs 
controlling trait II (table 10.3), and its detection power is the lowest, i.e., 53.5% 
(table 10.9). In addition, FDR from the two component traits is around 23.0%. 

When composite traits are used, those QTLs detected by both component and 
composite traits have comparable powers; however, those detected by component 
traits but not by composite traits have low detection powers. For example, qZ2, qZ5, 
qZ7, and qZ8 were not detected by addition (table 10.3), and their detection powers 
are 12.3, 47.2, 25.8, and 19.7%, respectively (table 10.9), which are much lower than 
those from traits I and II. qZ8 and qZ9 were detected by trait II with PVE at 2.93 
and 2.88%, respectively. They have medium-high detection powers when trait II is 
used in simulated populations, i.e., 70.0 and 64.0%, respectively, but have low 
detection powers when composite traits are used, i.e., 13.7%-21.7%. 

FDRs from subtraction and division are much higher than those from the two 
component traits, and the other two composite traits (table 10.9), which may 
explain the fact that some additional QTLs were identified by subtraction in the 
maize RIL population. QTL positions are nearly unbiased for both component and 
composite traits. But the effects are over-estimated for component traits and 
under-estimated for composite traits (tables 10.3 and 10.9). 


10.2.5 Heritability of Composite Traits 


As previously indicated, error variances associated with composite traits such as 
multiplication and division are difficult to acquire from the error variances of 
component traits in theory. Table 10.7 only gives the heritabilities on addition and 
subtraction. However, given the QTL distribution and effect model, genotypic values 
on component traits are known for any individuals or lines included in the simulated 
population. Genotypic values on composite traits can be acquired from addition, 
subtraction, multiplication, and division of the genotypic values from the compo- 
nent traits; phenotypic values on composite traits can be acquired similarly from the 
phenotypic values on component traits. Therefore, error effects associated with 
composite traits can be calculated by the difference between phenotypic and geno- 
typic values, based on which the error variances and heritabilities can be calculated 
for composite traits as well. Table 10.10 gives the broad-sense heritabilities for both 
component and composite traits under the three distribution and three effect models 
as defined in tables 10.5 and 10.6. The last row in the table is heritabilities in the 
maize RIL population, which are calculated from the one-way ANOVA on the 
replicated phenotypic values. 

For effect model A, the heritabilities of the four composite traits are equal to or 
lower than those of the component traits under distribution models A and B. Under 
distribution model C, the heritabilities of addition and multiplication are higher 
than those of subtraction and division. Linkage in the coupling, as present in 
addition and multiplication, increases the genetic variance (equation 10.9) and 
therefore increases the heritability as well. On the contrary, linkage in repulsion, as 


TAB. 10.9 — Simulation results of the detection power and FDR, based on the observed QTL distributions and effects in the maize RIL 
population. 


Parameter QTL Trait I Trait 11 Addition Subtraction Multiplication Division 
Power (%) qZ1 88.0 75.0 70.8 74.0 68.7 
qZ2 53.5 12.3 13.1 13.6 13.5 
qZ3 99.4 97.0 98.4 96.7 98.3 
qZ4 65.0 87.6 85.8 3.8 86.1 3.7 
qZ5 93.1 47.2 47.1 50.3 50.3 
qZ6 81.2 78.8 78.4 79.0 TT.A 
qZ7 43.1 25.8 27.9 26.0 26.4 
qZ8 70.0 19.7 20.8 21.7 23.1 
qZ9 64.0 14.5 13.7 15.9 15.5 
qZ10 65.1 86.3 88.5 1.5 88.4 1.5 
qZ11 95.2 98.7 99.6 5.5 99.7 4.4 
FDR (%) 23.40 22.64 15.18 32.70 15.80 32.76 
Estimated position (cM) qZ1 87.07 87.22 86.82 87.22 86.77 
qZ2 128.66 128.67 128.72 128.72 128.8 
qZ3 52.04 52.01 51.89 52.01 51.87 
qZ4 76.59 76.87 76.77 76.71 76.72 76.97 
qZ5 107.57 107.5 107.42 107.48 107.46 
qZ6 56.88 56.88 56.86 56.89 56.88 
qZ7 60.41 60.35 60.26 60.33 60.27 
qZ8 69.84 69.74 69.76 69.67 69.72 
qZ9 27.31 27.34 27.29 27.34 27.11 
qZ10 47.32 47.49 47.42 48.67 47.35 48.6 
qZ11 40.08 40.02 39.98 41.65 39.97 41.8 
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Estimated additive effect 


qZ1 
qZ2 
qZ3 
qZ4 
qZ5 
qZ6 
qZ7 
qZ8 
qZ9 
qZ10 
qZ11 


Tas. 10.9 — (continued). 


0.7893 
0.4326 
1.2259 
0.7549 0.7696 
0.5746 
1.0094 
—0.6161 
0.4705 
—0.4343 
—0.6859 —0.4951 
1.0011 0.7313 


0.8844 
0.7458 
1.2377 
1.4499 
0.7910 
1.0600 
—0.7341 
0.7193 
—0.6845 
—1.0589 
1.6664 


0.8728 
—0.7074 
1.2309 
—0.4973 
—0.7542 
1.0339 
—0.7519 
—0.7102 
0.6887 
—0.7348 
0.6296 


70.4176 
59.256 
97.5317 
116.1425 
63.1992 
83.0851 
—60.0649 
57.1931 
—54.3701 
—84.9817 
133.6006 


0.0111 
—0.0088 
0.0156 
—0.0069 
—0.0097 
0.0131 
—0.0098 
—0.0091 
0.0088 
—0.0040 
0.0074 
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TAB. 10.10 — Heritabilities of component and composite traits estimated in simulated RIL populations under the three distribution and three 
effect models as defined in tables 10.5 and 10.6. 


Effect model Distribution Trait I Trait II Addition Subtraction Multiplication Division 
model 

A: pure additive A 0.302 0.301 0.301 0.301 0.300 0.278 
B 0.366 0.364 0.367 0.362 0.365 0.332 
C 0.303 0.302 0.367 0.224 0.364 0.208 

B: additive and epistasis A 0.394 0.392 0.394 0.391 0.392 0.334 
B 0.435 0.433 0.436 0.431 0.435 0.378 
C 0.395 0.393 0.451 0.323 0.457 0.276 

C: pure epistasis A 0.178 0.177 0.178 0.176 0.177 0.162 
B 0.161 0.161 0.162 0.160 0.161 0.154 
C 0.178 0.177 0.194 0.160 0.193 0.148 

The maize RIL population 0.597 0.600 0.698 0.397 0.699 0.392 


Notes: heritabilities given at the last row for the maize RIL population are calculated from ANOVA on both component and composite traits. 
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present in subtraction and division, decreases the genetic variance (equation 10.9), 
and therefore decreases the heritability. Two component traits are positively cor- 
related in the maize population. No matter whether the positive correlation is 
caused by coupling linkage or pleiotropy, genetic variances will be reduced for 
composite traits subtraction and division. Therefore, the heritabilities of the two 
composite traits are reduced either (table 10.10). 

Even if the composite traits have similar heritability as component traits, more 
QTLs are involved in the genetic architecture of composite traits with more com- 
plicated linkage relationships and genetic effects. It is expected that the use of 
composite traits will result in the reduced power and the increased false positives, as 
have been observed from the simulation studies in the previous section (table 10.8). 
For the three distribution and three effect models, the heritability of addition is 
higher than that of subtraction (table 10.10), and the detection power from addition 
is also higher (table 10.8); the heritability of multiplication is higher than that of 
division (table 10.10), and the detection power from multiplication is also higher 
(table 10.8). In the maize RIL population, heritabilities are close to 0.7 for both 
component traits. Though the heritabilities are also close to 0.7 for addition and 
multiplication, two QTLs detected by component traits with relatively smaller 
effects were not detected by either addition or multiplication, e.g., qZ2 and qZ7. 
Multiplication and division have much lower heritability, and much more QTLs 
detected by component traits were not detected by either of them. 

In conclusion, the use of composite traits in genetic studies increases the gene 
number, causes a higher-order of gene interactions than observed in component 
traits, and possibly complicates the linkage relationship between QTLs as well. The 
increased complexity in genetic architecture associated with the composite traits is 
responsible for the reduced detection power and the increased FDR. Composite-only 
QTLs identified in practical mapping populations can be explained either as 
minor-effect QTLs that are not detected by component traits, or explained as false 
positives. 

In breeding, an index can be built to combine the information available on 
multiple traits and then used in selecting the optimum individuals or families 
(Bernardo, 2010; Falconer and Mackay, 1996; Baker, 1986). Different indices, such as 
optimum index, base index, and multiplicative index, have been proposed 
(Bernardo, 2010), which can also be treated as composite traits with much more 
complicated genetic architecture than the component traits from which the index is 
built. Few genetic studies have been conducted on indices, but this does not deter 
their use in breeding. In fact, genetic studies and breeding have objectives that are 
different, but not mutually contradictory in the broad sense. It is the breeders’ 
objective to combine as many favorable genes as possible. The use of composite 
traits or indices is efficient for selecting multiple favorable genes simultaneously. In 
contrast, geneticists have the objective of studying individual genes. For this pur- 
pose, the use of component traits may be more efficient, since the fewer genes 
involved, the easier for them to be properly investigated and dissected. 

Given that the composite traits are not really suitable for QTL mapping, how 
could we learn the genetic information on composite traits? In fact, the information 
on composite traits can be roughly deduced from the component traits. For example, 
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when one QTL has genetic effects in the same direction on both component traits, 
this QTL will affect the composite addition. When one QTL has genetic effects in the 
opposite directions on both component traits, this QTL will affect the composite 
subtraction. It is hard to say that the QTL affecting addition will affect the composite 
multiplication; the QTL affecting subtraction will affect the composite division. To 
work out the genetic factors on multiplication and division, one may consider using 
the logarithm transformation, and treating multiplication and division as addition 
and subtraction, respectively. But further investigation may be needed. 


10.3 Effects on QTL Detection by the Increase in Marker 
Density 


Two approaches can be considered in order to increase the power of statistical 
tests, i.e., the increase in sample size and the reduction in random errors. In the 
practical aspects, the size of a genetic population can be hardly increased once 
developed. The increase in replications in phenotyping trials can reduce the error 
effects included in phenotypic means, increase the trait heritability based on the 
replicated means, and finally increase the QTL detection power. In the past, 
genotypic screening was conducted only for tens to hundreds of polymorphism 
markers, such as restriction fragment length polymorphisms (RFLP) and simple 
sequence repeats (SSR). With the fast development in molecular marker tech- 
nologies, the number of markers that can be used in genotyping has increased 
significantly in the past two decades. One practical question arises, i.e., whether 
the QTLs can be more accurately mapped by using denser markers in an existing 
mapping population (Li et al., 2010; Piepho, 2000). In the genome-wide association 
studies (GWAS) as introduced in §9.5, denser markers can help to identify the 
residual linkage disequilibrium that remained in natural populations, and therefore 
increase the efficiency in QTL detection. This section will focus on the effects of 
denser markers in QTL linkage mapping studies. 


10.3.1 Effects of Denser Markers on Independent QTLs 


Ten chromosomes are considered in the simulation study, each of 160 cM in length. 
Three marker densities are considered, i.e., 5, 10, and 20 cM. Markers are assumed 
to be evenly distributed on chromosomes, i.e., marker numbers are equal to 330, 170, 
and 90 for the three densities, respectively. Eight QTLs with different levels of PVE 
i.e., 1%, 2%, 3%, 4%, 5%, 10%, 20%, and 30%, are assumed to be located at 22 cM 
on different chromosomes. No linkage is considered between QTLs. RIL populations 
are simulated with sizes from 20 to 500 increased by 20. The support interval in 
power analysis is 10 cM in length. Namely, the QTL detected in each simulated 
population at the interval from 17 to 27 cM on the chromosome is counted as true 
positive. QTLs located out of the support intervals are counted as false positives. 
Each size of the RIL population is simulated 1000 times (see Li et al. (2010) for more 
details). Figure 10.7 give the QTL detection powers and false discovery rates 
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(FDR) for various marker densities and population sizes, where ICIM is the map- 
ping method. Generally speaking, QTL detection power increases and FDR 
decreases as population size rises (figure 10.7), indicating the utmost importance of 
using large-sized populations in genetic studies. 

For QTLs explaining 2% or more of the phenotypic variation, detection powers 
are similar for marker densities 5 and 10 cM when population size is fixed, but the 
detection power is reduced for density 20 cM. For QTLs with PVEs equal to or 
greater than 10%, detection power can reach 100% when the population size exceeds 
200 (figure 10.7) even for marker density 20 cM. The largest difference in detection 
power is observed for QTL with the least PVE, no matter whether the population 
size is small or large. The power from marker density 20 cM is consistently lower 
than those from the other two densities, especially for QTL with medium-sized 
genetic effects, i.e., QTLs explaining 3%-10% of the phenotypic variation, indi- 
cating that marker density 20 cM may be too sparse in QTL mapping when the 
target is to identify QTL with medium-to-large genetic effects. 

The increase in marker density can benefit the QTLs with smaller effects, but 
FDR may be higher in the meantime (figure 10.7). Therefore, when using a higher 
density of markers to increase the QTL detection power, we should also have it in 
mind that the advantages of denser markers can be more efficiently realized in 
large mapping populations. When marker density has reached a given level, e.g., 
having markers every 10 cM in the genome, the increase in population size will 
make greater effects than the increase in markers, especially for small-to-medium 
effects of QTLs. 


Marker density 20 cM 


sss 


Detection power or FDR (76) 
u 
S 


Population size 


Fic. 10.7 — Effects of marker density and population size on QTL detection power and false 
discovery rate (FDR). 


10.3.2 Effect of Denser Markers on Linked QTLs 


Genetic linkage between QTLs affecting the phenotypic trait in interest complicates 
the genetic studies on the trait but has to be faced in some situations. From the 
linear model introduced in chapter 5, it is known that the position and effect 
information on one QTL can be completely absorbed by the two most closely 
linked markers at both sides of the QTL, i.e., the flanking markers. During 
one-dimensional scanning for the existence of QTL at a given marker interval, the 
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effects of QTLs located in other intervals on the same chromosome or other chro- 
mosomes could be controlled by adjusting the phenotypic values from the estimated 
linear model. The adjusted values only contain the information on the QTL which is 
located in the current scanning interval, and therefore the background genetic 
variation out of the current interval is controlled. However, if one marker is linked 
with two QTLs on both sides of the marker locus, the adjusted phenotypic values 
will contain the information from both QTLs, and the two linked QTLs cannot be 
properly separated. It has been stated in previous chapters, the linked QTLs have to 
be isolated by at least one empty marker interval (Li et al., 2007; Whittaker et al., 
1996). By empty intervals, it means there are no QTLs located. Obviously, the 
increase in marker density may bring empty marker intervals on the linkage map, 
make the non-isolated QTLs become isolated, and finally separate the linked QTLs 
properly. 

Figure 10.8 is a schematic representation of two QTLs (£.e., Qı and Qə) located 
at 22 and 42 cM on one chromosome under three marker densities. For marker 
density 20 cM, Qı and Qə are located in two neighboring intervals and therefore are 
not isolated (figure 10.8A). In the linear regression model of phenotype on marker 
types, the effect of Q, will be assigned to marker 4 and marker 8; the effect of Qə will 
be assigned to marker 8 and marker 12. Marker 8 is affected by both QTLs. While 
scanning for QTL in the interval defined by marker 4 and marker 8, partial effect 
from Qə is still retained in the adjusted phenotypic values. While scanning for QTL 
in the interval defined by marker 8 and marker 12, partial effect from Q; is still 
retained in the adjusted phenotypic values. LOD score reflects the confounding 
effects from both QTLs. High LOD scores would be observed in the two neighboring 
marker intervals if the linkage phase is coupling; low LOD scores would be observed 
if the linkage phase is repulsive. The two link QTLs cannot be correctly mapped in 
either situation. 

Under marker density 10 cM, Qı and Qə are isolated by one empty interval 
(figure 10.8B). In the linear regression model of phenotype on marker types, the 
effect of Qı will be assigned to marker 4 and marker 6; the effect of Qə will be 


A Marker density 20 cM 
Marker name 0 4 8 12 16 20 24 28 32 
Chromosome R 7 + ij + + b 
Qı Q, 
B Marker density 10 cM 
Marker name 0 2 4 6 8 10 12 14 16 18 200 2 24 026 28 130 32 
Chromosome 7 7 
Qı Q: 
c Marker density 5 cM 
Marker name 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 
Chromosome —— re — R + SS SSS + = 
Q Q; 


Fic. 10.8 — Schematic representation of two linked QTLs apart at 20 cM on one chromosome 
under three marker densities, i.e., 20 cM (A), 10 cM (B), and 5 cM (C). Notes: The two linked 
QTLs are located at 22 cM and 42 cM on the chromosome. For convenience, markers at 
density 5 cM are named by numbers from 0 to 32. These names are consistently used for 
density 10 cM and density 20 cM in the figure. 
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assigned to marker 8 and marker 10. VVhile scanning for QTL in the interval defined 
by marker 4 and marker 6, the adjusted phenotypic values only contain the infor- 
mation from Qı. While scanning for QTL in the interval defined by marker 8 and 
marker 10, the adjusted phenotypic values only contain the information from Qə. 
However, while scanning for QTL in the interval defined by marker 6 and marker 8, 
the adjusted phenotypic values still contain partial effects from both QTLs. High 
LOD scores would be observed in the empty interval and therefore the false positive 
QTL may occur. 

Under marker density 5 cM, Qı and Qə are isolated by three empty intervals 
(figure 10.8C). In the linear regression model of phenotype on marker types, the 
effect of Qı will be assigned to marker 4 and marker 5; the effect of Qə will be 
assigned to marker 8 and marker 9. While scanning for QTL in the interval 
defined by marker 4 and marker 5, the adjusted phenotypic values only contain 
the information from Qı. While scanning for QTL in the interval defined by 
marker 8 and marker 9, the adjusted phenotypic values only contain the infor- 
mation from Qə. While scanning for QTL in the interval defined by marker 5 and 
marker 6, the adjusted phenotypic values still contain partial effect from Qı, and 
high LOD scores would be observed; while scanning for QTL in the interval 
defined by marker 7 and marker 8, the adjusted phenotypic values still contain 
partial effect from Qə, and high LOD score could be observed as well. However, 
while scanning for QTL in the interval defined by marker 6 and marker 7, the 
adjusted phenotypic values do not contain any effects from both QTLs, and the 
LOD score in the interval would be close to zero. Therefore, the high LOD score 
present in intervals 5-6 and 7-8 may not cause any problem in locating the two 
linked QTLs properly. 

Figure 10.9 gives the average profiles of LOD score and additive effect for two 
population sizes (i.e., 100 and 500) and three marker densities (i.e., 20, 10, and 
5 cM). For each density and size, 100 RIL populations are simulated and ICIM is 
the mapping method. Two linked QTLs are located at 22 and 42 cM, the same as 
those shown in figure 10.8. Both additive effects are equal to 1, and the 
broad-sense heritability is equal to 50%. When marker density is 20 cM, the two 
linked QTLs cannot be separated by ICIM even for a population size 500. Instead, 
one ghost QTL is always observed between the two true QTL positions 
(figure 10.9). When marker density is 10 cM, two QTLs are detected only in a 
small number of simulated populations. Ghost QTLs are present with much higher 
additive effects in most simulated populations. When marker density is 5 cM, two 
QTLs are detected in most simulated populations, and their estimated effects are 
close to the pre-defined values. 

Figure 10.9 clearly indicates that linked QTLs cannot be separated if no empty 
interval occurs between them, or in the extreme situation when two QTLs are 
present in one marker interval. Under these situations, LOD score reflects the 
cumulative effect from both QTLs, and the increase in population size alone cannot 
help too much. More importantly, additional markers should be used in genotyping 
so that the linked QTLs can become isolated. 
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Fic. 10.9 — Average profiles of LOD score and additive effect from ICIM under two popu- 
lation sizes (i.e., 100 and 500) and three marker densities (i.e., 20, 10, and 5 cM). Notes: For 
each density and size, 100 RIL populations are simulated and used in QTL mapping. Two 
linked QTLs are located at 22 and 42 cM as shown in figure 10.8. Both additive effects are 
equal to 1, and the broad-sense heritability is 50%. 


10.4 Imputation of Missing Marker Types and Their 
Effects in QTL Mapping in Bi-Parental Populations 


10.4.1 Imputation of Missing and Incomplete Marker 
Types 


Missing marker types are common in most genetic populations due to various rea- 
sons. In genetic populations introduced in chapters 7 and 8, incomplete marker types 
are common as well. As far as the bi-parental populations are concerned, the 
co-dominant locus can be treated to be complete, but the dominant locus and 
recessive locus are incomplete. Given the linkage relationship between missing and 
non-missing markers, or between incomplete and complete markers, probabilities of 
the complete types included in missing and incomplete types can be determined and 
then used to impute the missing and incomplete types. The imputation algorithm 
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has been introduced in §7.3 for double cross F; populations, in §8.3 for four-parental 
pure-line populations, and in 88.4 for eight-parental pure-line populations. 

Given below is the imputation algorithm for bi-parental populations, taking the 
Fə population as an example which is also used to illustrate the effect of missing 
markers in QTL mapping (for more details see Zhang et al. (2010)). As given in 
table 1.2, A, H, B, and X are the valid genotypic values for co-dominant markers in 
Fə populations in the QTL IciMapping software. Missing value X will be imputed by 
the probabilities being A, H and B. D, B, and X are valid genotypic values for 
dominant markers. Incomplete type D will be imputed by the probabilities being A 
and H, and missing value X will be imputed by the probabilities being A, H and B. 
A, R, and X are valid genotypic values for recessive markers. Incomplete type R will 
be imputed by the probabilities being H and B, and missing value X will be imputed 
by the probabilities being A, H, and B. After imputation, all markers are belonging 
to the co-dominant category of complete types A, H, and B. This is the reason why 
the dominant, recessive and missing marker types can be ignored in QTL mapping in 
bi-parental populations. 

Assume the linkage map has been constructed from the linkage analysis as 
introduced in chapter 2 and construction algorithms, as introduced in chapter 3. Let 
M (from P,) and m (from P3) be the two alleles at the missing marker locus M, and 
z is a random number drawn from the uniform distribution U(0, 1). Imputation is 
performed by the order of markers on each chromosome, and thus all marker loci 
before the current one to be imputed are treated as co-dominant without any 
missing types. Three situations will be considered in imputation by the number of 
co-dominant markers that are linked with one current marker to be imputed. 


1. No linkage information can be utilized 


The current missing marker is not linked with any co-dominant markers, either 
on its left side or on its right side. In this case, no linkage information can be used, 
and the genotype for the missing marker type is imputed by Mendelian frequencies 
of the three genotypes, i.e., 0.25, 0.50, and 0.25. That is to say, MM is assigned to 
missing type X if z < 0.25; Mm is assigned if x < 0.75; otherwise, mm is assigned. If 
locus M is dominant, MM and Mm included in the incomplete type D follow the 
ratio 0.25:0.5 = 1:2. If locus M is recessive, Mm and mm included in the incomplete 
type R follow the ratio 0.5:0.25 = 2:1. The two ratios can be used to impute the 
incomplete types D and R at dominant and recessive markers, respectively. 


2. One linked co-dominant marker can be utilized 


There is one co-dominant locus without any missing values and linked with the 
current missing marker. Assume that A and a are the two alleles at the linked 
co-dominant locus A, and the recombination frequency is r between locus M and 
locus A. Table 10.11 gives the probabilities of genotypes MM, Mm, and mm under 
each genotype at locus A. For example, if one individual has marker type AA at 
locus A, the conditional probabilities being genotypes MM, Mm, and mm at locus M 
are (1 — r)”, 2r(1 — r), and r”, respectively. Therefore, MM is assigned to missing 
type X if z < (1 — r)”, Mm is assigned if z < (1 — r)?+2r(1 — r): otherwise, mm is 
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assigned. When r is much lower than 0.5, (1 — r)” would be much higher than 0.25. 
Obviously, the missing type X in one Fə individual having genotype AA at 
complete locus A would be more likely to be imputed as MM. If locus M is dominant, 
MM and Mm included in incomplete type D follow the ratio (1 — r)?:2r 
(1 — r) = (1 — r):2r. If locus M is recessive, Mm and mm included in incomplete 
type R follow the ratio 2r(1 — r):r” = 2(1 — r):r. The two ratios can be used to 
impute the dominant and recessive markers, respectively, for Fə individuals having 
genotype AA at locus A. For individuals having marker type Aa or aa at locus A, 
missing and incomplete types can be imputed in a similar way. 


TAB. 10.11 — Conditional probability of missing marker types to be imputed on one linked 
locus with non-missing and complete genotypes. 


Non-missing and complete genotypes Missing genotypes to be imputed 
MM Mm mm 

AA (1 — r)” 2r(1 — r) r? 

Aa r(l — r) (1 — 2r+2r?) r(1 — r) 

aq r? 2r(1 — r) (1 — r)” 


Note: r is the recombination frequency between one missing marker and one co-dominant 
locus without any missing types. 


3. Two flanking co-dominant markers can be utilized 


Two co-dominant loci (i.e., locus A and locus B) without any missing values are 
linked from both sides of the current missing marker M. Assume rı, rə and r are 
recombination frequencies between locus A and locus M, locus M and locus B, and 
locus A and locus B. Table 10.12 gives the probabilities of genotypes MM, Mm, and 
mm under each joint genotype at locus A and locus B. Assume one F, individual has 
genotype AA at locus A and BB at locus B. Genotype MM is assigned to missing 
type X if z< 5. Mm is assigned if z < .—— 
assigned. Obviously, missing type X in one Fə individual having genotype AABB 
would be much more likely to be imputed as MM. If locus M is dominant, MM and 
(1-n)/0—n) 

(G=) 7 


otherwise, mm is 


Mm included in incomplete type D follow the ratio 
ato) = (1 — m)(1 — rə):2rrə, If locus M is recessive, Mm and mm included 


in incomplete type R follow the ratio one, Tae = 2(1 — r:)(1 — r):17y 7. 


The two ratios can be used to impute the dominant and recessive markers, respec- 
tively, for Fə individuals having genotype AABB at locus A and locus B. For indi- 
viduals having other marker types at locus A and locus B, missing and incomplete 
types can be imputed in a similar way. 
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TAB. 10.12 — Conditional probability of missing marker types to be imputed by using two 
flanking loci which are co-dominant and have no missing genotypes. 


Non-missing Missing genotypes to be imputed 
genotypes MM Mm mm 
(1 — )”(1— na)” 2n.(1 — ri)re(1 — nə) rir 
AABB ~a- a= r}? (— r}? 
ə tl nna — ra) ry(1— rp)(1 — 2rp + 272) rirə(1 — rə) 
r(1 — r) r(1-=r) 2 
an (zay manato düzər 
ub vi(1 — r)(1— rə)” (1 — 2r) +2r?)r(1— r2) r(1— rı)r2 
r(1— r) r(1— r) 2 
əz 2r:(1 nrl- Tə) (1-2r 2r?) (1 2m4 2r2) 2r:(1— n)n(1 — n) 
1 — 2r-- 27? 1—2r+2r2 1—2r4+2r? 
dubh n(1— n)r3 (1—2r,4+ 2r?)m(1 — rə) ri(1 — ri)(1— rə)” 
Fo. =A ~ a=) 
.. a. m) 2 (1 — = — rə) (1 “nis 
rirə(1 — r) r(1— rp)(1 — 275 + 272) (i= nn- r2) 
aabb KEDE r(1 — r) — a 
rev 2r(1 — n)mil — nə) (1 — n)”(1 — m)” 
— CEDH (l= ry M — a 


Notes: the missing marker M is located between two co-dominant loci A and B without any 
missing values; rı, rə, and r are recombination frequencies between locus A and locus M, 
between locus M and locus B, and between locus A and locus B. Assume there is no 
interference, i.e., r= rı + n — 27) rə. 


10.4.2 QTLs on Plant Height in an F, Population in Rice 


One F, population in rice consists of 180 individuals from the cross made between 
indica rice variety PA64s and japonica rice variety Nipponbare (Zhang et al., 2010; 
Ye et al., 2005). Nipponbare was completely sequenced in 2002, and PA64s was 
partially sequenced in the same year. A total of 137 polymorphism SSR markers 
were screened in the Fy population, and the linkage map covered the 12 rice chro- 
mosomes, each distributed with 6-12 markers with an average density of 17.1 cM. 
There is a total of 24 660 (7.e., 180 X 137) marker data points, among which 5131 
are the parental type of PA64s, 6175 are the parental type of Nipponbare, 11 114 are 
heterozygous, 2240 are missing, accounting for 9.08% of the total marker data 
points. 

Parent PA64s carry one dwarfing gene with a plant height of 74.4 cm. The plant 
height of parent Nipponbare is 98.3 cm. ICIM is used to detect the QTLs affecting 
plant height, where the probabilities for marker variables to enter and leave the 


QTL 
name 
qPH1-1 
qPH1-2 
qPH3-1 
qPH3-2 
qPH4 
qPH5 
qPH6 
qPH7 
qPH12 


Marker 
interval 


R 


R 
R 
R 
R 
R 
R 
R 
R 


M246-RP2 
P82-RP3 
M523-RM251 
P242-RM520 
P67-08R15 
M159-RP299 
P199-RM276 
M82-RM180 
M19-RM247 


TAB. 10.13 — QTLs on plant height identified in an Fə population in rice. 


Distance to left 
marker (cM) 
12.0 

19.5 

16.9 

11.4 

13.7 

13.0 

6.2 

7.0 

2.4 


Additive 
effect (cm) 
—Ü.57 
—8.59 
4.35 
—4.69 
—3.56 
—0.44 
—0.79 
0.26 
—1.66 


Dominant 
effect (cm) 
—T.98 
0.59 
—4.86 
—1.00 
—2.09 
—4.48 
—5.05 
6.48 

3.93 


LOD 
score 
8.04 
15.54 
6.51 
5.04 
4.61 
3.13 
3.17 
5.27 
3.98 


PVE 
(%) 
12.03 
25.57 
13.30 
6.84 
5.53 
3.86 
4.96 
7.56 
5.44 


Degree of 
dominance 
13.96 
—0.07 
—1.12 
0.21 

0.59 

10.24 
6.36 
25.24 
—2.36 


SAY 


Surddeyy əuər) pue sısAyeuy əSeyur? 


More on the Frequently Asked Questions in QTL Mapping 479 


phenotypic linear model are set at 0.01 and 0.02, respectively. The LOD score 
threshold is 3.0, and the scanning step is 1 cM. Detailed information on the nine 
identified QTLs is given in table 10.13. Two plant-height QTLs are located on 
chromosomes 1 and 3, and one QTL is each located on chromosomes 4-7, and 12. 
qPH1-2, located between markers RP82 and RP3 on chromosome 1, has the largest 
PVE (i.e., 25.57%) and is approximately additive, i.e., the estimated additive effect 
is —8.59, and the dominant effect is 0.59, close to zero. 

In the population, most Fə individuals are taller than the shorter parent PA64s, 
indicating that PA64s carry most alleles to reduce the plant height. Fə individuals 
higher than the higher parent Nipponbare are also observed, indicating the presence 
of over-dominance at least at some QTLs. It can be seen from table 10.12 that seven 
QTLs have negative additive effects, indicating that the alleles to reduce plant height 
at these loci are harbored in parent PA64s. qPH1-1, qPH3-1, qPH5, qPH6, qPH7, 
and qPH12 are over-dominant QTLs, which explain the presence of individuals taller 
than the taller parent. qPH1-1, qPH5, and qPH7 have minor additive effects, which 
may be hard to be detected if the RIL population is used instead. Therefore, the 
mapping results for one phenotypic trait can be much different in different types of 
bi-parental populations even derived from the same two parental lines. 


10.4.3 Effects of Missing Marker Types on QTL Detection 


In the simulation experiment, linkage map of the actual rice Fy population and the 
QTLs identified for plant height are used, and two levels of population size, i.e., 180 
(same as the actual mapping population) and 500, are considered. QTL detection 
power and FDR are shown in figure 10.10 for seven missing levels, i.e., 0 (no missing 
markers), 5%, 10%, 15%, 20%, 25%, and 30%. No segregation distortion is included. 
Obviously, the more missing markers, the lower the QTL detection power would be 
regardless of the size of simulated populations. However, missing markers have 
greater effects on smaller-effect QTLs and smaller-sized populations. When the size 
of simulated populations is 180, same as size of the actual population, detection 
powers at the seven missing levels are 93.5%, 92.7%, 91.8%, 89.8%, 87.9%, 84.7%, 
and 84.7% for qPH1-2 (having the largest PVE at 25.57%), and 9.5%, 8.8%, 7.7%, 
6.3%, 5.4%, 4.1%, and 3.5% for qPH5 (having the smallest PVE at 3.86%), 
respectively (figure 10.10A). When the size of simulated populations is 500, little 
reduction in power is observed for qPH1-2 by the increase in missing markers 
(figure 10.10B). In comparison, the detection power of qPH5 is decreased from 
48.6% to 40.8%, 33.8%, 28.7%, 24.8%, 20.7%, and 15.1%, as the missing level 
increases from 0 to 5%-30% (figure 10.10B). 

Missing markers increase the false discovery rate (FDR) (figure 10.10) for both 
population sizes. For the seven missing levels, the respective FDR values are 38.6%, 
39.5%, 41.7%, 41.4%, 43.3%, 45.2%, and 45.3% when the size of simulated popu- 
lations is equal to 180, and 17.3%, 18.5%, 19.8%, 21.1%, 21.8%, 22.8%, and 24.9% 
when the size of simulated populations is equal to 500. Even though the missing 
markers reduce the QTL detection power and increase FDR, the QTL positions and 
effects are similar to those estimated from the non-missing markers (Zhang et al., 
2010). 
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Fic. 10.10 — Effect of missing marker types (i.e., 0, 5, 10, 15, 20, 35, and 30%) on QTL 
detection power and FDR in the F» populations with two sizes, i.e., 100 (A) and 500 (B). 


In another simulation experiment, the linkage map and the pre-defined QTLs are 
the same as before. No missing markers are considered, but the population size is 
reduced by 5%, 10%, 15%, 20%, 25%, and 30%, each from sizes 180 and 500. QTL 
detection power and FDR are shown in figure 10.11. Obviously, figure 10.11 shows a 
pattern similar to figure 10.10. For example, detection powers of qPH1-1 for pop- 
ulation size 180 at missing levels from 5% to 30% are 78.9%, 75.5%, 73.1%, 70.0%, 
67.8%, and 66.3%, respectively (figure 10.10A); for population sizes 5% to 30% 
smaller than 180 (7.e., 171, 162, 153, 144, 135, and 126) but with no missing markers, 
detection powers of qPH1-1 are 78.3%, 77.3%, 73.6%, 70.9%, 66.0%, and 66.8%, 
respectively (figure 10.11A). That is to say, the detection power of QTL in a pop- 
ulation with size n and missing marker level p is similar to that in a population with 
size n(1 — p) but no missing markers. 

Missing markers reduce the QTL detection power and increase the false discovery 
rate. The influence of missing genotypes is stronger for smaller effects of QTLs in 
smaller-sized populations. The effect of missing markers can be quantified by the 
reduced size of mapping populations where no missing markers are present. Highly 
likely, this conclusion can be extended to other mapping populations, such as those 
introduced in chapters 7 and 8. It is expected that incomplete markers will cause some 
negative effects on QTL mapping either, even though the effects can be hardly 
quantified. 

Most QTL mapping methods and software packages allow missing marker types 
in the mapping population, but missing values should still be avoided to the mini- 
mum level so as to acquire more accurate results. Imputation is just one technical 
approach that is adopted to simplify the analytical methods in genetic studies. It is 
impossible to completely recover the linkage or other genetic information lost by 
missing markers. If no additional information can be used, the imputation alone 
makes no effects on QTL detection, neither positive nor negative. However, when 
genotyping is re-conducted for individuals or families with missing marker types, 
additional information would be provided. It can be expected that the missing rate 
would be reduced, and the QTL detection power would be increased. 
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Fic. 10.11 — Effect of the reduced population size (i.e., 0, 5, 10, 15, 20, 35, and 30%) on QTL 
detection power and FDR in Fə populations without any missing marker types. Notes: 
(A) Population size is reduced by 5, 10, 15, 20, 25, and 30% from 180; (B) Population size is 
reduced by 5, 10, 15, 20, 25, and 30% from 500. 


10.5 Effects of Segregation Distortion on Genetic Studies 


10.5.1 Segregation Distortion Loci in One Rice 
Fə Population 


Imputation of missing and incomplete marker types and their effects on QTL 
detection were introduced in the previous section. Segregation distortion is another 
commonly encountered phenomenon in genetic populations. Segregation distortion 
loci (SDL) cause the adjacent markers to deviate from the expected Mendelian 
segregation ratio. Markers showing segregation distortion are called segregation 
distortion markers (SDM), or simply distorted markers. Segregation distortion is 
always related to a sterile gene or chromosome translocation, which has been widely 
investigated in genetics (Xu, 2008; Zhu et al., 2007; Luo et al., 2005; Hackett and 
Broadfoot, 2003; Lorieux et al., 1995; Hedrick and Muona, 1990). In the actual F, 
population as has been used in §10.4, the significance test against the Mendelian 
ratio 1:2:1 identified nine SDMs, distributed on chromosomes 2, 3, 5, 8, 10, 11, and 
12 (Zhang et al., 2010). The significance level is set at 0.01 for individual tests. In 
table 10.14, M:m represents the ratio of two parental alleles in the Fə population, 
and P is the significant probability in the distortion test. The largest value of the 
significance test, represented by the negative logarithm of significant probability, is 
equal to 20.02, corresponding to marker RP129 on chromosome 12 (table 10.14). 
Significant distortions are also observed on some other markers around RM304 and 
RP129, which may be caused by the linkage effect, and therefore are not given in 
table 10.14. 
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TAB. 10.14 — Segregation distortion loci identified in an Fə population in rice. 


Marker Chromosome Sample size M:m —logio(P) Fitness 

MM Mm mm MM Mm mm 
RP178 2 72 33 27 2.03 13.83 1.00 0.23 0.38 
RM143 3 85 64 27 1.98 11.14 1.00 0.38 0.32 
RM159 5 32 33 79 0.51 15.84 0.41 0.21 1.00 
RM44 8 64 83 22 1.66 4.54 1.00 0.65 0.34 
RM304 10 75 78 25 1.78 6.69 1.00 0.52 0.33 
RM147 10 39 27 78 0.57 16.80 0.50 0.17 1.00 
RM552 11 57 110 13 1.65 6.60 1.00 0.96 0.23 
RP129 12 92 87 1 3.04 20.02 1.00 0.47 0.01 
RM491 12 27 28 60 0.55 10.69 0.45 0.23 1.00 


Fitness is an important concept and parameter in investigating the effects of 
selection in population genetics (Wang, 2017; Hartl and Clark, 2007; Falconer and 
Mackay, 1996). Fitness takes the value from 0 to 1, representing the relative pro- 
ductivity of individuals with a given genotype in one genetic population. Assume 
two genotypes AA and aa each have 100 infant individuals in a population, but only 
10 individuals of AA and 9 individuals of aa can survive to the adult age and then 
contribute to the next generation by random mating. Genotype AA has a higher 
chance to pass the genes to the next generation, and its fitness is defined to be equal 
to 1. In comparison, the fitness of genotype aa is equal to 0.9. m one Fə population, if 
three genotypes AA, Aa, and aa have equal fitness, i.e., 1, no selection occurs at the 
locus, and the three genotypic frequencies would be equal to 0.25, 0.5, and 0.25, 
respectively. Otherwise, the three genotypes will deviate from the expected ratio of 
1:2:1. Therefore, the fitness values of the three genotypes can be estimated by 
comparing the observed genotypic frequencies with their expected values. 

Assume that allele M comes from parent PA64s, and allele m comes from parent 
Nipponbare. Let nmm, Num, and Nmm represent sample sizes of genotypes MM, Mm, 
and mm, receptively, and Nmax is the maxima of nmm, Enum, and zəm, Then, the 
fitness values of the three genotypes are equal to nmm/nmax, $Mm/ Nmax, and 
Tümm/ max, respectively. Given in the last three columns in table 10.14 are the 
estimated fitness values at each identified SDM. At some distorted markers, the 
parental genotype of PA64s has the highest fitness, such as RP178 and RM143; at 
other distorted markers, the parental genotype of Nipponbare has the highest 
fitness, such as RM159 and RM147. There is no locus in table 10.14, where the 
heterozygote has the highest fitness value. 


10.5.2 Effects of Segregation Distortion on QTL Mapping 
in Populations with Three Genotypes at Each Locus 


Assume that a and dare the additive and dominant effects at one QTL, and fgg, fo 
and fy are the frequencies of three QTL genotypes QQ, Qq, and qq, respectively. 
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When there is no distorted locus linked with this QTL, fgg, faq, and faq are expected 
to be equal to 0.25, 0.5, and 0.25, respectively, in an F» population, and the genetic 
variance caused by the QTL is given in equation 10.11. 
-... (10.11) 
Let fum, fum, and fmm be the frequencies of three marker types MM, Mm, and 
mm at one SDM, with the recombination frequency r between SDM and QTL. Due 
to linkage, distortion at the marker locus will cause the distortion at the QTL as 
well. Thus, fog, faq and fyg will deviate from expected frequencies of 0.25, 0.5, and 
0.25 either. The degree of the deviation depends on the recombination frequency 
between SDM and QTL, and the frequencies of three marker types at the distorted 
locus (table 10.15). The overall mean of genotypic values and genetic variance under 
distortion can be calculated by equation 10.12. 


w= md (foo — fog) + fogd 
abp = [fog + faa — (foo — İn) la” — 2foq(foa — far) 4d + (fea — i) d 


Let the degree of dominance at the QTL be s = d/a. From equations 10.11 and 
10.12, the ratio (k) of variance under distortion to variance with no distortion can be 
calculated by equation 10.13. It can be seen from equation 10.13, ratio k depends on 
the degree of dominance, and QTL genotypic frequencies. 


— ožp Z 4log + fog ii (fog = 17) mi 8foa(fog = Jaa) 8 + 4( fog > fog)? 


la. an .... (10.13) 


(10.12) 


TAB. 10.15 — Genotypic frequencies in the F, population under non-distortion and distortion. 


Genotype Frequency with no distortion Frequency with distortion 
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In statistics, the larger the variance caused by one factor with a number of levels 
or treatments, the easier the difference between the treatments can be detected, and 
the higher the detection power will be. The same is true in QTL mapping. When one 
QTL could cause larger genetic variance in the population, it would result in a 
higher LOD score and therefore be easier to be detected. Therefore, the ratio cal- 
culated by equation 10.13 can be used to approximately quantify the effect of one 
SDM on QTL mapping. That is to say, when k > 1, the distorted marker will 
increase the QTL detection power; when k < 1, the distorted marker will decrease 
the QTL detection power; otherwise, the distorted marker will not change the QTL 
detection power (Zhang et al., 2010). 

When a= 0, we have o%p = (fog — fog), o? — 149, and k= 4(foq — fG,)- It 
can be easily seen that the maximum value k = 1 is achieved when fg, = 0.5. That is 
to say when the sample size of genotype Qq is equal to the combined sample size of 
genotypes QQ and qq, QTL detection power would be the same as that of 
non-distortion; otherwise, the linked SDM would reduce the QTL detection power. 
If d = 0 or equally s = 0, we have k = 2[fgq+ fu — (foo — nq) ]- It can be proved 
that the maximum value k = 2 is achieved when fg, = 0, and fgg = fa = 0.5. These 
are the genotypic frequencies in F,-derived DH and RIL populations. In other words, 
the F,-derived DH and RIL populations without distortion have the greatest power 
in detecting the additive QTL. 

Figure 10.12 shows the contour distribution of ratio k by the change in genotypic 
frequencies for four degrees of dominance, i.e., s = 0, 0.5, 1, and 2. It can be seen 
that for any degree of dominance, k can be higher than 1, lower than 1, or equal to 1, 
depending on the three genotypic frequencies. When the genotypic frequencies 
(summing up to one) can be assumed to follow the uniform distribution, kis greater 
than 1 at chances about 47%, 51%, 50%, and 29% for the four degrees of dominance, 
respectively (figure 10.12), which are estimated from the size of the area where k > 1 
in the contour distribution. The effect would be negative if the distortion reduces the 
genetic variance of QTL. However, the effect can be positive as well if the distortion 
happens to increase the genetic variance of QTL (Zhang et al., 2010; Xu, 2008). 


dla=2 R 
Prob(k>1) = 29% k 
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Prob(k>1) = 51% 


dlazl 
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Fic. 10.12 — Ratio of genetic variance at a locus with distorted genotypic frequencies in 
comparison with no distortion in F populations. Notes: Assume the three genotypic fre- 
quencies can take any values between 0 and 1 randomly, but add up to one. The four degrees 
of dominance represent no dominance, partial dominance, complete dominance, and over 
dominance. 
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Fic. 10.13 — Effects of segregation distortion on QTL detection in Fə population. Notes: 
(A) Population size 180: (B) Population size 500: The nine distorted markers are indicated in 
table 10.13, and no segregation distortion is used as the check (£.e., the black bar on the most 
left side of each QTL): Positions and genetic effects of the nine plant height QTLs are 
indicated in table 10.12. 


Assume that QTLs on most phenotypic traits are different from SDL or SDM. 
Distortions in genotypic frequencies observed at QTLs are caused by the linked SDL 
or SDM. Therefore, the effect of segregation distortion on QTL detection depends on 
the following factors: degree of dominance of the QTL to be detected, fitness values 
of the linked SDL, and linkage distance between SDL and QTL. To further illustrate 
the effect of distortion, Fə populations are simulated by using the linkage map 
constructed from the actual F population, nine QTLs on plant height as given in 
table 10.12, and nine SDMs as given in table 10.13. Each SDM is assumed to be 
located at the nearest left marker linked with each plant height QTL, and the other 
136 markers are assumed to follow the Mendelian segregation ratio. Non-distortion 
is used as the check. QTL mapping is conducted in the simulated populations, and 
then the detection powers are counted and shown in figure 10.13. The readers are 
encouraged to refer to Zhang et al. (2010) for detailed information on how the 
simulation was conducted. 

When the size of simulated Fy populations is equal to 180, the detection powers of 
qPH1-1, qPH5, and qPH7 are similar to or lower than those of non-distortion 
(figure 10.13A). Taking qPH7 as an example, each distorted marker is assumed to be 
located at RM82 in the simulation, which is closely linked with qPH7. In comparison 
with non-distortion, the detection power of qPH7 is reduced by —13.7%, —4.2%, 

0.6%, —4.0%, -1.6%, -13.776, —5.3%, —5.1%, and —7.3% for the nine distorted 
markers, respectively. However, for other QTLs such as qPH3-1, qPH3-2, qPH4, and 
qPH5, some SDMs resulted in increased detection power, but others resulted in 
reduced detection power (figure 10.13A). For example for qPH3-2, the deviations in 
detection power from non-segregation distortion are 0.7%, 0.6%, 9.2%, —8.3%, 
—4.5%, 8.8%, -14.1% —27.0% and 8.8% for the nine SDMs, respectively. Distorted 
markers RM44, RM304, RM552, and RP129 have lower detection power than 
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non-segregation distortion; other distorted markers have similar or even higher 
powers. For size 500, the impact of SDM is not as obvious as that observed for size 
180. Similar trends can be observed only for some QTLs on plant height, such as 
qPH4, qPH6, and qPH12 (figure 10.13B). The differences in detection powers 
between the nine SDMs and non-segregation distortion are also smaller than those 
observed for size 180. Therefore, the impact of SDM may be reduced in populations 
with large sizes. 

As mentioned earlier, the effect of segregation distortion on QTL mapping can be 
quantified by the ratio (k) of variances 02, and o? caused by one QTL under dis- 
tortion and non-distortion (equation 10.13). Theoretically, SDM will not affect the 
QTL detection when k = 1; if ora is higher than o?, i.e., k > 1, SDM will benefit QTL 
mapping; if k< 1, SDM will reduce the detection power. When compared with 
non-distortion, most changes in the detection power as observed in figure 10.13 are 
coincident with the change of genetic variance when SDM is present (Zhang et al., 
2010). 


10.5.3 Genetic Distance That can be Affected 
by Segregation Distortion 


How far can one QTL be affected by one linked SDM? The four most distorted 
markers in table 10.13, i.e., RP178, RM159, RM552, and RP129, are used here to 
show the change of ratio k by the distance between the distorted marker and QTL 
(figure 10.14). Ratio k can be either much higher than one or much lower than one 
when the QTL is closely linked with one SDM. However, as the increase in distance 
between SDM and QTL, ratio k approaches one for the four SDMs and nine QTLs 
on plant height, indicating the reduced effects from the linked SDLs. From 
figure 10.14, it can be roughly seen that the ratio k has values 0.8-1.2 when the 
linkage distance is over 20 cM. If the effect of SDM can be ignored for ratio k in that 
region, genetic distance affected by segregation distortion in QTL mapping can be 
treated as 20 cM (Zhang et al., 2010). 


qPH7 


Ratio of genetic variance 


0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 
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Distance in cM between the distorted marker and QTL 


Fic. 10.14 — Ratio of genetic variance at the linked QTLs (e.g., the nine plant height QTLs) 
by the increased map distance with four distorted markers (e.g., RP178, RM159, RM147, and 
RP129) in the rice F, population. 
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In conclusion, segregation distortion affects QTL detection when QTLs and seg- 
regation distortion markers or loci are closely linked. Sometimes it makes the QTL 
detection easier, but sometimes it reduces the QTL detection power, as is determined 
by the change of genetic variance explained by the QTL. Segregation distortion does 
not produce more false QTLs; neither does it have a significant impact on the esti- 
mation of QTL position and effect. In practice, if the distortion is not extremely 
serious, the effect from distortion can be ignored in large-sized mapping populations. 


10.5.4 Effects of Segregation Distortion on QTL Mapping 
in Populations with Two Genotypes at Each Locus 


In populations consisting of two genotypes, the effect of distortion becomes much 

simpler. Let frequencies of two QTL genotypes be p and 1 — p in an actual popu- 

lation, and the expected frequencies under non-distortion are fand 1 — f. Therefore, 
2 

pm rü- _ 1-(1— 2p) 


@ fU- 41-—(1-2/” 


When two genotypes have the expected ratio 1:1 (i.e., f = 0.5), such as F;-derived 
RIL or F:-derived DH lines, any distortion will cause the ratio k smaller than 1, and 
therefore reduce the QTL detection power. When two genotypes have other expected 
ratios, such as 3:1 (i.e., f = 0.75) in backcross-derived RIL or backcross-derived DH 
lines, the distortion could result in k values higher than one and therefore increase the 
detection power. Figure 10.15 shows the ratio of genetic variances in BC,-derived and 
F:-derived DH populations. Obviously, in Fı-derived DH populations, ratio k cannot 
be higher than one. In BC,-derived DH populations, ratio kis higher than one when 
the frequency of genotype QQ is 0.25-0.75. Any distortion causing the genotypic 
frequency in that region will increase genetic variance at the QTL, and detection 
power as well. 
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Fic. 10.15 — Ratio of genetic variance of the distorted QTL in comparison with no segre- 
gation distortion in BC, and F:-derived DH populations. 


488 Linkage Analysis and Gene Mapping 


10.6 Non-Normality of the Phenotypic Distribution 


10.6.1 Phenotypic Model and Distribution of Quantitative 
Traits 


For most QTL mapping methods, error effects included in phenotypic values have to 
be assumed to follow the normal distribution with a mean of zero. The phenotypic 
performance of one quantitative trait approaches a normal distribution only under 
the multi-factorial hypothesis. Non-normality of phenotypic distribution does not 
affect QTL mapping (Li et al., 2010). Based on the quantitative genetics theory 
(Wang, 2017), phenotypic observation (P) of an individual or one line in a given 
environment is the summation of its genotypic value (G) and one random error effect 
(e), which is actually the major content of pure line theory. Therefore, phenotypic 
observation can be expressed by a linear model, 


P=G+e (10.14) 


where the error effect follows the normal distribution with mean 0, and a given 
2/ ie., e~ N(0, o2). It can be seen from equation 10.14, the replicated 
observations from one specific genotype follow a normal distribution as well, i.e., 
P~ N(G,o?), which is also the basic assumption for most statistical methods. 

In one genetic population, a number of genotypes are included. In most cases, 
each individual or line included in the population has one unique genotype. The 
segregation in genotypes is the prerequisite for the population to be useful in genetic 
studies. Different genotypes have different genotypic values, and therefore their 
phenotypic observations follow different normal distributions, even though the error 
effects can be assumed to follow the same normal distribution N(0, o?). In fact, the 
phenotypic distribution of the genetic population is a mixture of a number of normal 
distributions. Under the additive and dominant model including a number of 
q QTLs, genotypic values can be represented by equation 10.15. 


variance o 


O 


G=m+ 3” lay) + dr] (10.15) 


j=l 


where m is the overall mean of the population consisting of homozygous genotypes 
only (or the reference population), ur and v, are genotypic indicators for the jth QTL 
with values 1 and 0 for genotype QQ, values 0 and 1 for genotype Qq, values —1 and 
0 for genotype qq (see also §5.3 in chapter 5). 

Different populations have different allelic and genotypic frequencies. In equa- 
tion 10.15, different genotypes have different G values with different frequencies in 
the genetic population. Under the multi-factorial hypothesis, different G values 
approach a normal distribution. Therefore, after the modification by random errors, 
phenotypic observations also approach a normal distribution. However, when the 
multi-factorial hypothesis is not true, such as there are only a few QTLs, and one or 
two of them have relatively large genetic effects, G values calculated from 
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equation 10.15 can be far from a normal distribution. Under this situation, phe- 
notypic observations given by equation 10.14 can be far from a normal distribution 
either, even after the modification of normally-distributed random errors. 

It is better to have an example to demonstrate the non-normality of phenotypic 
distribution in genetic populations. Assume one QTL is located at 25 cM in one 
chromosome of 160 cM in length. Additive and dominant effects of the QTL are 
equal to 1 and 0, respectively, and the other genetic effects are not considered. The 
population mean is equal to 10, and the error variance is equal to 0.2. In the DH 
population, genotypes qq and QQ both have a frequency of 0.5; their phenotypic 
observations follow normal distributions N(9, 0.2) and N(11, 0.2), respectively. 
Genetic variance Vg = 1, phenotypic variance Vp = 1.2, and the broad-sense her- 
itability H? = 83.3%. The theoretical distribution is a mixture of two normal dis- 
tributions N(9, 0.2) and N(11, 0.2) each with a frequency at 0.5 (figure 10.16A). Two 
peaks are present in the mixture distribution, which cannot be a normal distribution 
at all. m the Fy population, three genotypes qq, Qq, and QQ have frequencies at 
0.25, 0.5 and 0.25. Genetic variance Vg = 0.5, phenotypic variance Vp = 0.7, and 
the broad-sense heritability H? = 71.4%. The theoretical distribution is a mixture of 
three normal distributions, £.e., N(9, 0.2), N(10, 0.2), and N(11, 0.2), with fre- 
quencies at 0.25, 0.5, and 0.25, respectively (figure 10.16B). Though one peak is 
present in the mixture distribution, it is still far away from a normal distribution. 
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Fic. 10.16 — Theoretical distributions of the genotypic components in DH (A) and F, 
(B) populations (i.e., the dotted lines), and theoretical distributions of the entire population 
(i.e., the solid lines). 


10.6.2 QTL Mapping on Phenotypic Traits 
of the Non-Normal Distributions 


Using the one-QTL model from the previous example, one mapping population with 
200 DH lines is simulated by QTL IciMapping software. Phenotypic values take the 
range from 8 to 12. More DH lines are concentrated around values 9 and 11, which 
are not normally distributed (figure 10.17A). The sample mean and variance of the 
simulated population are equal to 10.04 and 1.15, close to the population mean of 10 
and population variance of 1.2, respectively. On the LOD score profile from 
one-dimensional scanning of ICIM, one clear peak occurs near position 25 cM with a 
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Fic. 10.17 — QTL mapping on a non-normally distributed phenotypic trait in one simulated 
DH population. Notes: (A) Observed phenotypic distribution; (B) LOD score profile in QTL 
mapping; (C) Additive effect profile in QTL mapping. One QTL is located at 25 cM on the 
chromosome with additive effect 1.0, and dominant effect 0.0. Error variance is equal to 0.2. 
The simulated population has 200 F,-derived DH lines. 


LOD score of 92.01 (figure 10.17B). The additive effect at the peak is 0.9867, and the 
phenotypic variance explained is 81.12%, close to the theoretical effect (i.e., 1.0) and 
broad-sense heritability (7.e., 83.3%) (figure 10.17C). 

Using the same QTL model, one mapping population with 200 F» individuals is 
simulated by QTL IciMapping software. Phenotypic values take the range from 8.0 
to 12.0, but more individuals are concentrated around the value 10.0. The dis- 
tribution has two long tails, which are not normally distributed (figure 10.18A). 
The sample mean and variance of the simulated population are equal to 10.03 and 
0.73, close to the population mean (i.e., 10.0) and population variance (i.e., 0.7), 
respectively. On the LOD score profile from the one-dimensional scanning of 
ICIM, one clear peak occurs near position 26 cM with a LOD score of 53.91 
(figure 10.18B). Additive and dominant effects at the peak are 0.9269 and 0.0632, 
close to true values of 1.0 and 0.0, respectively (figure 10.17C). The phenotypic 
variance explained at the peak position is 68.46%, close to theoretical broad-sense 
heritability at 71.4%. 

It can be seen from the two simulated populations (figures 10.17 and 10.18), 
whether the phenotypic values of a quantitative trait are normally distributed or not 
is not a pre-condition to conducting the QTL mapping studies. Non-normality does 
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Fic. 10.18 — QTL mapping on a non-normally distributed phenotypic trait in one simulated 
Fə population. Notes: (A) Observed phenotypic distribution; (B) LOD score profile in QTL 
mapping; (C) Additive and dominant effect profiles in QTL mapping. One true QTL is 
located at 25 cM on the chromosome with additive effect 1.0, and dominant effect 0.0. Error 
variance is equal to 0.2. The simulated population has 200 Fə individuals. 


not affect the QTL detection, together with the correct estimation of QTL position 
and genetic effects. However, similar to most classical statistical methods such as 
ANOVA, least square estimation, and maximum likelihood estimation, random 
error effects included in phenotypic values have to be assumed to be normally dis- 
tributed. The normal distribution of random errors has been widely proven and 
accepted in practice. For genetic populations used in QTL mapping, if there are no 
replicated observations, the normality test on random errors is impossible as the 
error effects cannot be separated from the un-replicated phenotypic values. However, 
when the replicated observations are available, error effects can be estimated by the 
deviation of each observation to the replicate mean, and then used in the normality 
test. For each individual or line in the population, the replicated mean can be used to 
estimate the genotypic value, i.e., G = P. Random effects included in each obser- 
vation can be estimated by equation 10.16. 


é=P—-G=P-P (10.16) 


By equation 10.16, random error included in each observation for each individual 
or line can be acquired. As defined by equation 10.14, the error effects are required 
to follow the identical and independent normal distribution (iid), i.e., e~ N(0, o?) 
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iid. Therefore, the normality test can be conducted by taking all error effects esti- 
mated by equation 10.16 as one population. In fact, the normality assumption as 
made in any ANOVA linear model is referred to random effects either, and the 
normality test should be based on the residual effects included and estimated in the 
linear model. 


Exercises 


10.1 Assume there are three chromosomes, each of 120 cM in length and evenly 
distributed with markers every 10 cM. There are two independent QTLs affecting 
one quantitative trait. One is located at 28 cM on the first chromosome with 
additive effect of 1, and the other one is located at 41 cM on the second chromosome 
with an additive effect of 0.5. There is no dominant effect at each QTL, and any 
other genetic factors are not considered. Population mean is set at 10, and error 
variance is set at 0.5. Genotypes of two parental lines are Qı Qı Qə Qə and qı qı qoqə. 


1) Calculate genetic variance, phenotypic variance, and heritability in the 
bi-parental DH population, and dravv the theoretical phenotypic distribution of 
the population. 

2) Use the QTL IciMapping software to simulate one mapping population with 
200 DH lines. Draw the frequency distribution on the simulated phenotypic 
values, and conduct QTL mapping in the simulated DH population. 

3) Calculate genetic variance, phenotypic variance, and heritability in the 
bi-parental Fə population, and draw the theoretical phenotypic distribution of 
the population. 

4) Use the QTL IciMapping software to simulate one mapping population with 
200 Fə individuals. Draw the frequency distribution on the simulated pheno- 
typic values, and conduct QTL mapping in the simulated Fy population. 


10.2 Assume there are three chromosomes, each of 120 cM in length and evenly 
distributed with markers every 10 cM. There are two linked QTLs affecting one 
quantitative trait, located at 28 cM and 57 cM on the first chromosome with 
additive effects 1.0 and 0.5, respectively. There is no dominant effect at each QTL. 
Any other genetic factors are not considered. Population mean is set at 10, and 
error variance is set at 0.5. Genotypes of two parental lines are Q) Qı Qə Qə and 


1% 92 92- 


1) Calculate genetic variance, phenotypic variance, and heritability in the 
bi-parental DH population, and draw the theoretical phenotypic distribution of 
the population. 

2) Use the QTL IciMapping software to simulate one mapping population with 
200 DH lines. Draw the frequency distribution on the simulated phenotypic 
values and conduct QTL mapping in the simulated DH population. 

3) Calculate genetic variance, phenotypic variance, and heritability in the 
bi-parental Fə population, and draw the theoretical phenotypic distribution of 
the population. 
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(4) Use the QTL IciMapping software to simulate one mapping population with 
200 F, individuals. Draw the frequency distribution on the simulated pheno- 
typic values and conduct QTL mapping in the simulated F» population. 


10.3 Refer to §10.3.2. Assume one chromosome is 160 cM in length and located with 
two QTLs affecting a phenotypic trait at 22 cM and 42 cM, respectively. Additive 
effects of the two QTLs are equal to 1 and —1, and heritability in the broad sense is 
equal to 50%. Other genetic effects are not considered. Use the simulation func- 
tionality in the software QTL IciMapping to investigate the effects of three marker 
densities, i.e., 20 cM, 10 cM, and 5 cM, on QTL detection. Any bi-parental popu- 
lation available in the software can be used. 


10.4 Given in the first row (excluding the title) in the following table are the 
observed numbers of individuals that are resistant (i.e., R) or susceptible (i.e., S) to 
a disease in each of the three marker types (t.e., AA, Aa, and aa) in an F» popu- 
lation. Following the sample sizes is the estimate of recombination frequency (r) 
between one co-dominant marker and the dominant resistant gene, standard error of 
the estimate, and LOD score in testing the linkage relationship. Use the 2pointREC 
tool available in the software QTL IciMapping to verify the estimates, standard 
errors, and LOD scores given in other rows, each representing one scenario of 
segregation distortion caused by the selection on marker genotypes. 


Type Type AA Type Aa Typeaa Total Estimate Standard LOD 
of data R SR S R S. size ofr error of the score 
estimate 

Complete 572 3 1161 22 14 569 2341 0.0179 0.0027 488.55 
1/3 of AA 191 1 1161 22 14 569 1958 0.0163 0.0026 446.86 
1/3 of Aa 572 3 387 7 14 569 1552 0.0172 0.0033 415.11 
1/2 of aa 572 3 1161 22 7 285 2050 0.0198 0.0033 331.37 
No AA 0 0 1161 22 14 569 1766 0.0155 0.0026 426.02 
No Aa 572 3 0 0: 14 569 1158 0.0169 0.0037 377.82 
No aa 572 3 1161 2 0 0 1758 0.0237 0.0044 173.89 
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Quantitative Trait Loci by 
Environment Interactions 
Simeng MSc (2014-2017) Linkage Analysis and QTL §8.1, §8.3 
Zhang Mapping in Genetic 
Populations Consisting of 
Pure Lines Derived from 
Four-way Crosses 


Jinhui Shi MSc (2016-2019) Algorithm and Applications 88.2, §8.4 
of Inclusive Composite 
Interval Mapping in 
Pure-line Populations 
Derived from Eight-way 
Crosses 


Pingping Qu PhD (2018-2022) Construction of Consensus §8.3, §8.4 
Linkage Map with 
Applications in Gene 
Mapping in Wheat 
(Triticum aestivum L.) 
Using 90K SNP Array 


Appendix C: Integrated Software 
Packages Making Up This Book 


Software Latest Targeted populations 
version 
QTL Version 4.2 Twenty bi-parental 
IciMapping populations derived from 
two homozygous parents, 
NAM, CSSL 


Functionalities 


(1) AOV: Analysis of 
variance for single- and 
multi-environmental 
phenotyping trials; (2) SNP: 
Converting the SNP 
genotyping data to the 
software format; (3) BIN: 
Binning of redundant 
markers; (4) MAP: 
Construction of genetic 
linkage maps in biparental 
populations; (5) CMP: 
Consensus map construction 
from multiple genetic linkage 
maps sharing common 
markers; (6) SDL: Mapping 
of segregation distortion loci 
in biparental populations; 
(7) BIP: Mapping of additive 
and digenic epistasis genes in 
biparental populations; 

(8) MET: QTL by 
environment interaction in 
biparental populations; 

(9) CSL: Mapping of additive 
and digenic epistasis genes 
with chromosome segment 
substitution lines; (10) NAM: 
QTL mapping in nested 
association mapping 
populations. 
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Software 


GACD 


GAPL 
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Latest 
version 


Version 1.2 


Version 1.2 


(continued). 


Targeted populations 


F, (or full-sib) family from 
two heterozygous parents, 
and double cross F, from 
four homozygous parents 


Pure-line progeny 
populations from the 
inter-mating of 3-8 
homozygous parents 


Functionalities 


(1) SNP: Converting the 
SNP genotyping data to the 
software format; (2) BIN: 
Binning of redundant 
markers; (3) CDM: 
Construction of genetic 
linkage maps in single cross 
F, of two heterozygous 
parents and double cross F; 
of four homozygous parents; 
(4) CDQ: Gene detection in 
double cross F, of four 
homozygous parents. 


(1) SNP: Converting the 
SNP genotyping data to the 
software format; (2) BIN: 
Binning of redundant 
markers; (3) PLM: 
Construction of genetic 
linkage maps in 
multi-parental pure-line 
populations; (4) PLQ: Gene 
detection in multi-parental 
pure-line populations. 


